Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

Bhagyajit Pingua; Deepak Murmu; Meenakshi Kandpal; Jyotirmayee Rautaray; Pranati Mishra; Rabindra Kumar Barik; Manob Jyoti Saikia

doi:10.7717/peerj-cs.2374

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

Bhagyajit Pingua¹, Deepak Murmu¹, Meenakshi Kandpal¹, Jyotirmayee Rautaray¹, Pranati Mishra¹, Rabindra Kumar Barik², Manob Jyoti Saikia ^3,4

1School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India

2School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, Odisha, India

3Electrical and Computer Engineering, The University of Memphis, Memphis, TN, United States

4Biomedical Sensors & Systems Lab, The University of Memphis, Memphis, FL, United States

DOI: 10.7717/peerj-cs.2374

Published: 2024-10-22
Accepted: 2024-09-09
Received: 2024-07-05

Academic Editor: Martina Iammarino

Subject Areas: Artificial Intelligence, Emerging Technologies, Natural Language and Speech, Security and Privacy
Keywords: Large language models, Vector databases, Embeddings, LLM, Jailbreak attacks, Llama 2 13B

Copyright: © 2024 Pingua et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Pingua B, Murmu D, Kandpal M, Rautaray J, Mishra P, Barik RK, Saikia MJ. 2024. Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G) PeerJ Computer Science 10:e2374 https://doi.org/10.7717/peerj-cs.2374

The authors have chosen to make the review history of this article public.

Abstract

Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM’s vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at: https://doi.org/10.5281/zenodo.13501821.

Introduction

In recent times, large language models (LLMs) like Llama (Meta) (Touvron et al., 2023) have achieved significant popularity permeating diverse fields ranging from virtual assistants to chatbots and content generation to sentiment analysis. LLMs have democratized access to advanced natural language processing (NLP) capabilities, that enable developers and organizations to build applications that leverage the power of natural language understanding. LLMs employ algorithms, such as transformers, to analyze and generate text with remarkable fluency and coherence. Figure 1 illustrates the mechanism of a LLM. Auto-regressive transformers undergo pretraining on a large corpus of self-supervised data and are subsequently fine-tuned to align with human preferences using methods like Reinforcement Learning with Human Feedback (RLHF) (Touvron et al., 2023; Alsentzer et al., 2019).

Figure 1: Mechanism of LLMs.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-1

According to OWASP (non-profit organization expertise in cybersecurity and aims to protect web applications from cyber-attacks) currently the topmost threat to LLMs is “Prompt Injection”. Jailbreaking is the class of attacks that attempt to subvert safety filters built into the LLMs themselves. With increasing popularity of LLMs, advances in adversarial prompts, known as jailbreaks, exploit vulnerabilities in architecture and implementation of LLMs (Hasan, Rugina & Wang, 2024). Specific queries can prompt LLM chat models to produce unsuitable content, that is exemplified by the well-known case of instructing chat models with “Do Anything Now (DAN)” (Shen et al., 2023). An increasing number of prompts with similar effects have been identified. While numerous jailbreak methods continue to surface, currently there is a notable absence of a structured and exhaustive fair evaluation framework for these techniques. There has been limited exploration of the security of LLMs on a broad scale. It is crucial to undertake a thorough investigation to uncover these vulnerabilities. Das, Amini & Wu (2024) provides a comprehensive examination of the protection and confidentiality concerns of LLMs, along with strategies for defense. The phenomenon of hallucinations in LLMs, as highlighted by Huang et al. (2023), presents a significant obstacle, casting doubt on the credibility of their outputs. Frequently, LLMs produce information that appears convincing but is either factually incorrect or nonsensical, as observed by Dai et al. (2023).

Self-reminder jailbreak attacks manipulate prompts to exploit a model’s tendency to follow instructions literally, leading to unintended or inappropriate output. These attacks craft seemingly harmless prompts that can be misinterpreted or used maliciously. Prompt-G specifically addresses this by distinguishing between harmful and benign prompts, preventing the model from generating inappropriate content and maintaining its integrity. It ensures the model understands context and intent, safeguarding against subtle manipulations. Vector databases use advanced indexing algorithms to organize and store high-dimensional vectors that represent jailbreak prompts. These vectors capture semantic relationships, enabling efficient and scalable retrieval of similar terms. Techniques like locality-sensitive hashing (LSH) or tree-based indexing allow for fast and accurate querying of large text corpora (Shen et al., 2023).

Embeddings are low-dimensional representations of words or phrases learned through neural network-based models, such as BERT (Xie et al., 2024). These embeddings encode semantic and syntactic information about words by capturing their contextual usage in each corpus.

Together, vector databases and embeddings provide powerful tools for analyzing and processing textual data. Their integration enables a wide range of tasks, including information retrieval. Vector databases facilitate efficient storage and retrieval of embeddings, this accelerates inference and improves the model’s responsiveness, particularly in applications requiring real-time text generation. Additionally, embeddings provide LLMs with rich semantic representations of words and phrases, enabling them to capture subtle nuances in meaning and context. This improves the model’s capability to produce logical and context-relevant responses (Jing et al., 2024).

Our approach harnesses the integrated functionalities of vector databases and embeddings, combined with a repository of identified jailbreak attacks, to mitigate the generation of inappropriate output.

Goal conflict (Wei, Haghtalab & Steinhardt, 2024) arises when the pursuit of one goal compromises the achievement of the other. In the case of LLM chat models, this conflict manifests when producing a safe response may result in sacrificing the accuracy of the response, or vice versa.

Safe response: Ensuring that LLM chat models generate safe and ethical responses is essential to avoid harmful or offensive content. This is particularly important in user-facing applications like customer service bots or educational platforms, where maintaining a positive user experience and upholding ethical standards are key.
Accurate response: Producing accurate responses requires generating contextually relevant, coherent, and informative content. Accuracy is crucial for meaningful communication, involving a clear understanding of input and context to effectively address user queries.

Two types of scenarios are outlined:

Scenario 1: Sacrificing safety for accuracy: If the LLM prioritizes accuracy over safety, it may generate responses that are factually correct but contain inappropriate or harmful content. This could occur if the model lacks robust mechanisms for filtering out toxic or offensive language.

Scenario 2: Sacrificing accuracy for safety: Conversely, if the LLM prioritizes safety over accuracy, it may err on the side of caution and produce generic or evasive responses that lack depth or relevance. This could happen if the model is overly conservative in its response generation, avoiding certain topics or phrases to minimize the risk of generating harmful content.

Balancing these conflicting goals is challenging (Sinaga & Yang, 2020) but essential for developing responsible and effective LLM chat models. Strategies for mitigating goal conflict may involve implementing robust filtering mechanisms to ensure safety without compromising accuracy, or adopting context-aware approaches that prioritize safety while maintaining relevance and coherence in responses. Ultimately, finding the optimal balance between producing safe and accurate responses is crucial for enhancing the usability, reliability, and ethical integrity of LLM chat models.

Objective

The objective is to identify and prevent jailbreaking attacks that exploit system vulnerabilities to gain unauthorized access or manipulate functionality, ensuring system integrity and security. This includes guiding the generation of responses by LLMs to adhere to ethical principles, societal norms, and legal regulations, promoting responsible and principled application of artificial intelligence technologies. This approach will reduce the detrimental effects of data poisoning attacks, which attempt to manipulate training data to undermine the learning process’s integrity and effectiveness. By safeguarding the model against such malicious manipulation, this method ensures the accuracy and reliability of its outputs.

Contributions

The key contributions of our article include the following:

It proposes a unique model i.e., Prompt-G which is identifying harmful user prompts and system prompts.
It examines various types of system prompts that, when paired with harmful questions, contribute to executing jailbreak attacks.
It also integrates with LLM to reduce the number of unintended responses.
Finally, it generates 300 responses by combining four user prompts with 75 system prompts for further analysis.

Organization

The rest of this article is organized as follows. “Related Works” provides an overview of the related works. “System Architecture Information” delves into the essential components required for our work. “Proposed Framework” details the proposed framework descriptions. “Implementation” details the implementation aspects of our work, including the specific algorithms employed. We delve into the practical realization of our proposed approach, outlining the chosen algorithms and explaining their role in achieving the desired outcomes. “Results and Discussion” presents the results of our work using a combination of graphical representations and statistical analyses to effectively communicate our findings. “Challenges and Future Work” explores potential avenues for future research. Finally, “Conclusion” concludes our work by summarizing the key findings.

Related works

Since the introduction of language models such as ChatGPT and Llama, numerous vulnerabilities have been identified. These article analyze the landscape of jailbreak attacks, outlining strategies for both identification and mitigation. Additionally, they delve into the inherent vulnerabilities of LLMs and propose techniques to overcome these weaknesses, ultimately aiming to strengthen the security and reliability in LLMs.

Jain et al. (2023) assessed multiple foundational defense approaches against prominent adversarial attacks targeting LLMs, exploring their applicability and efficacy across different scenarios. Their investigation revealed that the limitations of current discrete optimizers when applied to text, along with the considerable expenses associated with optimization, render conventional adaptive attacks more difficult to execute on LLMs.

Kim, Yuk & Cho (2024) followed a procedure in which the language model continuously evaluated and enhanced its responses autonomously, as observed in constitutional AI. They proposed leveraging the language model’s inherent self-refinement capabilities directly.

To mitigate jailbreaking assaults, Zhang et al. (2023) advocated for the incorporation of goal prioritization during both training and inference phases. Integrating goal prioritization during inference markedly reduces the ASR of jailbreaking attempts.

Zhou et al. (2024) adopted an iterative methodology to refine both defensive and offensive agents. This iterative refinement process enhanced defenses against newly formulated jailbreak prompts, ensuring continual improvement. They employed agents learning to orchestrate an adversarial game.

Jin et al. (2024) adopted a unique yet intuitive approach for generating jailbreaks inspired by human-like generation. They gathered pre-existing jailbreak instances and segmented them based on distinct attributes using clustering methods that analyzed both frequency and semantic patterns at the sentence level.

Xi et al. (2024) proposed masking-differential prompting (MDP), which is an innovative, lightweight, and adaptable defense strategy designed for Pre-trained language models (PLMs) operating as few-shot learners. It took advantage of the discrepancy in sensitivity between poisoned and clean samples. By employing the limited few-shot data, it examined sample representations across various masking scenarios to identify poisoned samples that exhibited significant deviations.

Wei, Haghtalab & Steinhardt (2024) proposed potential conceptual failure modes inherent in LLMs safety training and illustrated how these insights can inform the development of efficient jailbreak attacks.

By thoroughly scrutinizing defense approaches and attack strategies implemented on various LLMs. Xu et al.’s (2024) objective was to assess the efficacy of these attack and defense methodologies. Conventional white-box attacks exhibited lower performance compared to universal techniques, and the inclusion of specialized tokens in the input substantially influences the success rate of attacks.

Through extensive experimentation, Chu et al. (2024) consistently observed that optimization in jailbreak attacks achieved better rates of attack success and demonstrated resilience across various LLMs. Furthermore, they delved into the balancing attack effectiveness and efficiency, illustrating the continued viability of jailbreak prompt transferability, particularly in the context of black-box models.

Robey et al. (2023) employed SmoothLLM for addressing adversarial-prompting-based jailbreak attempts. The approach primarily centered on generating variations of a specific prompt by making slight modifications. Subsequently, the diverse responses generated for each modified version are aggregated and analyzed.

These referenced studies investigated the vulnerabilities of LLMs and the ways they can be exploited. These insights have significantly informed and guided our approach to addressing these issues in our research. Table 1 shows related research on preventing jailbreak attacks against LLMs.

Table 1:

Related research on preventing jailbreak attacks against LLMs.

Author(s)	Title	Keywords	Methods
Jain et al. (2023)	Baseline defences for adversarial attacks against aligned language models	Adversarial attacks, detection, input preprocessing, paraphrasing	Analysed LLMs’ defence methods; conventional attacks hindered by optimizer limitations
Robey et al. (2023)	SmoothLLM: defending large language models against jailbreaking attacks	SmoothLLM, query efficiency, conservatism	SmoothLLM countered jailbreak attempts by generating prompt variations through slight modifications and analyzing aggregated responses.
Zhang et al. (2023)	Defending large language models against jailbreaking attacks through goal prioritization	Goal prioritization, inference stage, risk mitigation	Implemented goal prioritization in training and inference phases to reduce jailbreaking Attack Success Rate (ASR).
Kim, Yuk & Cho (2024)	Break the breakout: reinventing LM defense against jailbreak attacks with self-refinement	Adversarial exploitation, self-refine, defense baselines	Employed constitutional AI to enable LM to autonomously evaluate and enhance its responses continuously
Zhou et al. (2024)	Defending jailbreak prompts via in-context adversarial game	Fine tuning, adversarial training, empirical studies,versatile defence	Iterative refinement strengthened defenses against jailbreak prompts using adversarial game
Jin et al. (2024)	GUARD: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models	Proactive testing, role-playing system, knowledge graph, empirical validation	Utilized human-like generation, clustering pre-existing jailbreak instances based on frequency and semantic patterns at sentence level.
Xi et al. (2024)	Defending pre-trained language models as few-shot	Backdoor attacks, masking detection and prevention, lightweight defense, detection evasion	MDP defended PLMs by leveraging differences in masking sensitivity, using limited few-shot data as benchmarks to identify tainted samples.
Wei, Haghtalab & Steinhardt (2024)	JailBroken: how does LLM safety training fail	Safety goals, red-teaming, scaling, Ad hoc jailbreaks	Identified inherent conceptual failure modes in LLMs’ safety training and demonstrated how these inform the development of efficient jailbreak attacks.
Xu et al. (2024)	LLM jailbreak attack vs. defense techniques a comprehensive	Jailbreaking, white-box attacks, special tokens, testing framework	Evaluated attack and defence strategies on different LLMS to gauge their effectiveness in various scenarios.
Chu et al. (2024)	Comprehensive assessment of jailbreak attacks against LLMs	Safeguards, role-playing, attack success rates	Optimized jailbreak prompts achieved high attack success rates across various language models, emphasizing prompt transferability.

DOI: 10.7717/peerj-cs.2374/table-1

System architecture information

Dataset

To analyze various jailbreak techniques, we gathered 75 Self Reminder jailbreak attacks (https://github.com/yjw1029/Self-Reminder-Data/blob/master/data/jailbreak_prompts.csv). Each of these system prompts was paired with four questions, enabling a comprehensive evaluation of the effectiveness of each jailbreak prompt and allowing us to assess their impact across different scenarios and contexts.

For our model filter, we stored 666 jailbreak prompts (https://github.com/verazuo/jailbreak_llms/blob/main/data/prompts/jailbreak_prompts_2023_05_07.csv) and 390 harmful questions (https://github.com/verazuo/jailbreak_llms/blob/main/data/forbidden_question/forbidden_question_set.csv) in two separate vector databases, where similarity searches were conducted. During the preprocessing of the jailbreak prompt dataset, any mention of the chat model’s name was replaced with “Llama,” attributed to Meta.

Large language models

In this work, we employ Meta’s LLM known as llama-2-13b-chat.ggmlv3.q6_K.bin. This LLM is distinguished by its quantized nature (Yao et al., 2024), utilizing the innovative new k-quant method. Employing GGML_TYPE_Q8_K for all tensors, it implements 6-bit quantization, ensuring compatibility with llama.cpp. Llama 2 represents an advancement over its predecessor, Llama 1, featuring updated training on a revised amalgamation of publicly accessible datasets. Llama 2 has parameters spanning 7, 13, and 70 billion, catering to diverse research needs and computational resources (Touvron et al., 2023).

This LLM model is characterized by several key parameters:

n_ctx: This parameter denotes the context window size, set at 2,048 tokens, indicating the maximum number of tokens that can be input to the model.
n_threads: Representing the number of CPU cores utilized during training, this parameter is crucial for accelerating the computationally intensive training process of LLMs. Leveraging multiple CPU cores enhances training efficiency.
n_batch: Referring to the batch size employed during training, this parameter determines the number of samples processed in each iteration of the training algorithm. Larger batch sizes can optimize GPU utilization and expedite training; however, they also demand more GPU memory.
n_gpu_layers: This parameter specifies the number of layers (or blocks) within the model allocated to the GPU during training. In scenarios where the GPU’s VRAM is limited, not all layers may fit simultaneously. Thus, n_gpu_layers determine the number of layers retained on the GPU during training, with the remaining layers processed on the CPU.

The present research was carried out on a Jupyter Notebook. The LLM generated responses using an NVIDIA T4 GPU with 15 GB of VRAM. The system had 12.7 GB of RAM and a disk size of 78.2 GB.

Embedding models

The Python library known as sentence-transformers serves as a potent asset for generating dense vector representations of textual sentences, known as sentence embeddings. Figure 2 demonstrates the operation of an Embedding Model. These embeddings excel in capturing semantic similarity and contextual nuances, rendering them invaluable for a broad spectrum of natural language processing endeavors including semantic search, clustering, and classification tasks. By leveraging sentence-transformers, users can efficiently map sentences into high-dimensional vector spaces, facilitating seamless similarity comparisons and subsequent downstream analyses. Renowned for its user-friendly interface, adaptability, and efficacy, sentence-transformers has garnered substantial popularity among both researchers and practitioners in the realm of text-based applications. One notable model within this library is, all-mpnet-base-v2 model that transforms sentences into a vector space with 768 dimensions.

Figure 2: Working procedure of embedding model.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-2

The all-mpnet-base-v2, has a context window spanning 384 tokens and is characterized by a dimensionality of 768 values. Its functionality involves accepting a list of strings as input and generating a corresponding list of embeddings, where each embedding consists of 768 floating-point numbers representing the semantic text embedding of the respective string. String inputs are limited to a maximum length of 384 tokens, roughly equivalent to 280 words. Any strings exceeding this length threshold will undergo truncation before being processed through the embedding model.

Our model utilizes this embedding model to transform the dataset of known jailbreak attacks specifically, the system prompts that lead the LLM to generate malicious responses—into mathematical representations, or vectors. Additionally, the embedding model converts the dataset of harmful questions into vector form. These vectors effectively capture and preserve the semantic meaning of words, sentences, and paragraphs.

Vector store

Vector stores are specialized databases crafted specifically for the efficient storage and retrieval of vector embeddings. Embeddings serve as numerical representations of data, typically unstructured data like text, within high-dimensional vector spaces. Traditional relational databases are ill-equipped to manage the storage and querying of these vector representations effectively. Vector stores excel in indexing and swiftly searching for similar vectors using dedicated similarity algorithms. This capability enables applications to identify related vectors based on a provided target vector query efficiently.

Chroma DB serves as an open-source vector store tailored for the storage and retrieval of vector embeddings. Its primary function revolves around preserving embeddings alongside associated metadata for future utilization by LLMs. Moreover, it extends its utility to powering semantic search engines for text data. The platform supplies SDKs for Python emphasizing simplicity, speed, and facilitating analysis. Furthermore, Chroma DB presents a self-hosted server option for enhanced control and flexibility. Figure 3 demonstrates the working of a Vector Store Chroma DB.

Figure 3: Working model of vector store Chroma DB.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-3

The vector store ChromaDB utilizes the embedding model all-mpnet-base-v2 to store vector representations of known jailbreak attacks and harmful questions generated by the embedding model. ChromaDB employs cosine distance to measure the similarity between these vectors, facilitating the identification of similar jailbreak attacks and determining whether the responses generated by them are alike.

To assess the similarity between the responses generated from the 75 system prompts against the response by DAN when used as system prompt, we employed the Chroma retriever, which utilizes cosine distance as a similarity measure. Cosine distance, which is calculated as $1 - c o s i n e_s i m i l a r i t y$ , is closely related to cosine similarity.

(1) $c o s i n e_s i m i l a r i t y (a, b) = \frac{a . b}{| | a | | | | b | |}$

(2) $c o s i n e_d i s t a n c e = 1 - c o s i n e_s i m i l a r i t y .$

Equations (1) and (2) demonstrate the calculations for cosine similarity and cosine distance, respectively. In these equations, vectors $a$ and $b$ represent the entities for which similarity is being measured. A lower cosine distance value (close to 0) indicates a high similarity between vectors, suggesting that the vectors are aligned in the same or nearly the same direction. Conversely, a higher cosine distance value (close to 2) indicates a low similarity, meaning the vectors are pointing in opposite directions and are therefore very different from each other.

Toxicity analyzer

We employ HuggingFace’s martin-ha/toxic-comment-model (Balestriero, Cosentino & Shekkizhar, 2023) toxicity analyzer to determine the toxicity score of LLM-generated responses. This model specializes in classifying responses as either toxic or non-toxic. By inputting a string into the model, it assesses whether the text is toxic or non-toxic. This facilitates the analysis of responses and enables the filtration of toxic content. This model returns a label of either “toxic” or “non-toxic.” If the toxicity score is higher than the non-toxicity score, the label “toxic” is returned; otherwise, the label “non-toxic” is assigned.

The toxicity analyzer plays a crucial role in assessing whether the generated responses contain any potentially harmful or offensive content. Its primary function is to identify and filter out language that could negatively impact the sentiments of various communities. Our AI model is specifically designed to produce responses that are neutral and unbiased, ensuring that it does not favor or discriminate against any particular group of people. By utilizing the toxicity analyzer, we aim to maintain a high standard of inclusivity and respect in the interactions generated by the model, thereby fostering a safe and welcoming environment for all users.

Transformers

Transformer architecture, a form of artificial neural network (ANN), gains an understanding of contextual significance by examining affiliations within continuous data. These architectures can identify nuanced correlations among distant elements in a sequence, rendering them adaptable for a range of tasks across sequential data types.

Initially trained on extensive datasets, transformers produce precise prediction, thus driving their widespread acceptance and enabling the development of even more sophisticated models. As a result, transformers have started to replace convolutional and recurrent neural networks (CNNs and RNNs), which held sway as dominant deep learning models just 5 years ago (Liu et al., 2024). Prior to the emergence of transformers, training neural networks demanded substantial, labeled datasets, which were both expensive and time-intensive to obtain. However, transformers circumvent this necessity by mathematically identifying patterns between elements, thereby harnessing the wealth of image and information from the internet.

Transformers comprise of extensive encoder/decoder blocks. Attention queries are typically conducted simultaneously via multi-headed attention, where matrices of equations are computed in parallel. Transformers empower LLMs to glean insights from extensive textual data, enabling continual enhancement of their language comprehension and generation capabilities. Through the utilization of pre-trained transformer models and subsequent fine-tuning on conversational dataset, LLMs can adeptly adjust to diverse domains and communication styles. Transformers equip LLMs with the ability to grasp the intricate connections between words and phrases within conversational contexts. This feat is accomplished through mechanisms like attention, which enables the model to concentrate on pertinent segments of the input text while formulating responses. Furthermore, transformers facilitate the modeling of extended dependencies, enabling LLMs to uphold context and coherence throughout prolonged dialogues.

Proposed framework

The dataset comprises 666 established jailbreak prompts and 390 identified harmful questions. These entries, encompassing known jailbreak attacks, are incorporated into the vector store, Chroma DB. Its core purpose revolves around retaining embeddings along with pertinent metadata to aid future utilization by extensive language models. Additionally, Chroma DB serves to drive semantic search engines for textual data. Figure 4 presents our proposed framework of Prompt-G Filter.

Figure 4: Overall working strategy of prompt-G framework.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-4

Chroma DB stores jailbreak prompts and harmful questions in embeddings that preserve their meaning effectively. To accomplish this, Chroma DB employs the all-mpnet-base-V2 model through the sentence Transformers Embedding Function to convert prompts into embeddings, ensuring the retention of their contextual meaning and facilitating efficient retrieval and analysis.

When a user inputs a prompt, it consists of three integral parts that together compose the prompt template. The first part, known as the System prompt, furnishes context to the LLM, guiding it to respond within the specified context before addressing the user’s inquiry. The system diverts attention of the LLM, leading to the generation of harmful responses (Tian et al., 2023). The second component, the User prompt, contains the actual question posed to the LLM. The LLM formulates its response solely based on the provided system prompt. Lastly, the third section of the prompt template houses the LLM’s response. By inputting both the system prompt and the user prompt, the entire prompt template is generated, from which the response located in the third part is extracted.

Implementation

Analyzing various jailbreak attacks

We compiled a dataset consisting of 75 self-reminder jailbreak attacks, all serving as system prompts. For each of these 75 prompts, we formulated four distinct questions. These questions are from the dataset https://github.com/yjw1029/Self-Reminder-Data/blob/master/data/attack_prompt.json.

These questions were designed to analyze the LLM’s capability to generate responses related to fake news, phishing email templates, extraction of malicious information, and structured query language (SQL) injection scenarios. Consequently, a total of 75*4 responses were generated.

Furthermore, we subjected the same set of four questions to the “Do anything Now (DAN 15.0)” (Shen et al., 2023), system prompt, renowned as one of the most potent system prompts to date. Figure 5 displays the scenario where the LLM is prompted to generate SQL injection, with DAN serving as the system prompt. This prompt has the ability to induce the LLM to generate responses that may be unsuitable or unintended. We subsequently evaluated semantic similarity between the four-response produced by each of the 75 jailbreak attacks and the corresponding responses generated by the DAN jailbreak attack.

Figure 5: Generating SQL injection using DAN as the system prompt.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-5

Similarly, we examined the semantic similarity between the 75 system prompts and the DAN system prompt. Following each cosine similarity assessment, the resulting values were recorded in a dataset. Subsequently, we constructed graphs where the x-axis represented the similarity between responses, while the y-axis denoted the similarity between prompts. Four separate graphs were created, each corresponding to one of the four different responses. While the y-axis remained consistent across all graphs, the x-axis varied, representing the similarity between responses for the four distinct questions.

Next, we employed K-means clustering with k = 2 to categorize the responses into two clusters based on their similarity scores. High similarity indicated that the respective system prompt was highly effective in eliciting unintended responses from the LLM. This clustering analysis facilitated the differentiation of responses with high similarity from those with low similarity, providing valuable insights into the efficacy of different system prompts in generating unintended responses from the LLM.

From each of the four graphs, we extracted responses demonstrating high similarity and computed the percentage of high similarity responses. This methodology enabled us to evaluate the efficacy of the system prompts and assess the extent of vulnerability of the LLM to jailbreak attacks.

Algorithm for analyzing various jailbreak attacks

This algorithm as shown in Fig. 6 outlines a process for analyzing a dataset of jailbreak system prompts using a LLM.

We begin by loading the data containing jailbreak system prompts, likely retrieved from a source like a spreadsheet using a library called pandas. Two lists are created: one to store the original jailbreak prompts and another to hold four specific attack-related questions we want to ask about each prompt. An empty list is prepared to accumulate the results. The algorithm then loops through each jailbreak prompt in the dataset. For each prompt, another empty list is created to temporarily store the corresponding answers. We looped through the four attack questions one by one. Inside this loop, the current jailbreak prompt and the current question are combined to form a complete prompt for the LLM. This constructed prompt is then sent to the LLM along with other necessary settings. The LLM’s response, which acts as the assistant’s answer, is extracted from the model. The extracted answer for the specific attack question is then added to the temporary list holding responses for the current jailbreak prompt. Once all questions are processed for a particular jailbreak prompt, the entire list of answers (one for each question) is added to the results list. The process repeats by iterating through the remaining jailbreak prompts in the dataset, building and collecting responses until all prompts are analyzed. This approach essentially uses the LLM to answer a set of standardized questions about each jailbreak system prompt, helping us to assess and understand the prompts better.

Safe response generation

Prompt-G conducts a query against the Vector store which stores the various types of known jailbreak prompts and harmful questions, employing cosine similarity as the metric of choice. The calculation yields a decimal value, with cosine similarity providing the result. A value closer to zero indicates greater semantic similarity between the two prompts. Conversely, a value closer to one signifies less similarity between the prompts. This search is performed for both the system prompt and user prompt.

Case 1: If similarity checks are performed on both the system prompt and user prompt, and the model identifies a significant similarity between harmful questions stored in the vector store and the user input prompt, it then assesses similarity between prompts in the vector store and the system input prompt. In such instances, the model disregards the input system prompt and adheres to the system prompt provided by us. Additionally, the temperature of the model is reduced to 0. The temperature parameter of the LLM influences its output, determining whether the output leans towards randomness and creativity or predictability. Lowering the temperature prompts the LLM to generate a response that prioritizes safety and ethical considerations.

Case 2: If a high similarity is identified between harmful questions in the vector store and the input user prompt, but there is a low similarity between prompts in the vector store, no modification to the system prompt is deemed necessary, and the temperature of the LLM remains unchanged. In such scenarios, the LLM independently generates a response that is safe and ethical. This precaution is crucial because high similarity between the system prompt and harmful questions can distract the LLM, that lead to generation of unintended responses.

Therefore, the filtering mechanism ensures that the LLM produces responses that are safe and ethical by identifying and rejecting any responses that contain unintended, unsuitable, or malicious information. Should the model generate a response identified as toxic, as assessed by the martin-ha/toxic-comment-model, the LLM receives instructions to generate an alternative response that is non-toxic. As a result of this approach, there was a notable reduction in the occurrence of unintended response Generation by the LLMs.

Whenever the LLM encountered a harmless question, no adjustments were applied to either the system prompt or the input user prompt. Consequently, these prompts remained unaltered and were not subjected to filtration by Prompt-G. Figure 7 demonstrates the collaborative operation of the Vector Store and Embeddings, illustrating how they function together to generate a response from the LLM.

Figure 7: Integration of vector stores and embedding models.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-7

Algorithm for safe response generation

The algorithm as shown in Fig. 8 begins by loading datasets of jailbreak system prompts and harmful questions using the pandas library and creating vector databases to store them. Two functions, “check_question” and “check_prompt”, are defined to assess the similarity between input prompts and prompts in the respective databases. If high similarity is detected, these functions return 1; otherwise, they return 0. Another function, “query,” constructs a prompt template combining system, user, and assistant prompts, then uses the LCPP-LLM model to generate a response based on this template and a specified temperature. If the response is labeled as “toxic,” the LLM is queried again with a modified system prompt and the input question. Finally, depending on the outcomes of the “check_question” and “check_prompt” functions, the LLM is queried with either modified or original prompts, adjusting the temperature accordingly.

Figures 9 and 10 depict the instances where the LLM is tasked with generating SQL injection and fake news, respectively, utilizing ARIA as the system prompt.

Generating SQL injection via ARIA system prompt in LLM. — Figure 9: Generating SQL injection *via* ARIA system prompt in LLM.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-9

Figure 10: Fake news generation using ARIA system prompt in LLM.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-10

Results and discussion

Upon scrutinizing the jailbreak prompts and exposing the LLM to four distinct attack types, we were able to evaluate which forms of jailbreak attacks effectively induced the generation of inappropriate content.

Four graphs were plotted, each showing the similarity of responses to their corresponding DAN response plotted against the similarity of prompts to DAN. Figure 11A depicts for fake news response similarity vs. prompt similarity. Figure 11B depicts phishing email response similarity vs. prompt similarity. Figure 11C depicts malicious information response similarity vs. prompt similarity. Figure 11D depicts SQL injection response similarity vs. prompt similarity. Numerous jailbreak attacks successfully prompted the LLM to produce fake news. However, for the other three types of attacks, a discernible pattern emerges from the data: a substantial proportion of jailbreak attacks resulted in the generation of unsuitable content, while only a minority failed to do so. It is imperative to note that the LLM accepted all jailbreak attacks that succeeded in generating unsuitable content.

Figure 11: (A) Fake news response similarity vs. prompt similarity. (B) Phishing email response similarity vs. prompt similarity. (C) Malicious information response similarity vs. prompt similarity. (D) SQL injection response similarity vs. prompt similarity.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-11

To identify clusters associated with the successful generation of unsuitable content and those that are not, we applied K-means clustering with k = 2. Figure 12 presents the application of K-means clustering to different types of attack questions administered to the LLM. Figure 12A depicts K-means clustering for fake news response similarity vs. prompt similarity. Figure 12B depicts K-means clustering for phishing email response similarity vs. prompt similarity. Figure 12C depicts K-means clustering for malicious information response similarity vs. prompt similarity. Figure 12D depicts K-means clustering for SQL injection response similarity vs. prompt similarity. Our analysis showed that 77.33% of the initial 75 system prompts were identified as successful jailbreaks in generating fake news. Furthermore, the similarity in responses for phishing email generation was found to be 81.08%, while the similarity in responses for producing malicious content and SQL injection stood at 82.43% and 74.32%, respectively. Our study essentially evaluated how many system prompts produced responses that closely resembled those generated when DAN was used as the system prompt. Table 2 illustrates the similarity in responses of different question types.

Figure 12: (A) K-means clustering for fake news response similarity vs. prompt similarity. (B) K-means clustering for phishing email response similarity vs. prompt similarity. (C) K-means clustering for malicious information response similarity vs. prompt similarity. (D) K-means clustering for SQL injection response similarity vs. prompt similarity.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-12

Table 2:

Comparision of response similarity across different question types and system prompts.

Types of attacks	Total number of system prompts	Number of similar responses	Responses similar (%)
Fake news	75	58	77.33
Phishing email	75	60	81.08
Malicious content	75	61	82.43
SQL injection	75	55	74.32

DOI: 10.7717/peerj-cs.2374/table-2

Our model was evaluated using both harmful and harmless questions, and the results were analyzed using statistical metrics such as precision, recall, accuracy, and F1-score. Additionally, a confusion matrix was generated. Figure 13 displays the confusion matrix, from which precision, recall, F1-score, and accuracy were derived. Table 3 presents the LLM’s efficiency in detecting harmful questions.

Figure 13: Confusion matrix of harmful and harmless questions.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-13

Table 3:

Efficacy of LLM in detecting harmful questions.

Types of question	Precision	Recall	F1-Score	Support
Harmless questions	0.96	0.94	0.95	50
Harmful questions	0.98	0.99	0.99	190

DOI: 10.7717/peerj-cs.2374/table-3

Precision indicates the proportion of instances that the model correctly identified as positive out of all instances it predicted as positive.

(3) $P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e} .$

Recall reflects the model’s ability to correctly identify actual positive instances.

(4) $R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e} .$

Accuracy represents the overall proportion of correct predictions, both positive and negative, relative to the total number of instances.

(5) $A c c u r a c y = \frac{T r u e P o s i t i v e + T r u e N e g a t i v e}{T r u e P o s i t i v e + T r u e N e g a t i v e + F a l s e P o s i t i v e + F a l s e N e g a t i v e}$

The F1-score, which is the harmonic mean of precision and recall, is particularly valuable in cases of class imbalance, as it balances the trade-off between precision and recall.

(6) $F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} .$

During the testing phase, we exposed the LLM to a total of 190 harmful questions (Shen et al., 2023). The model’s filtering mechanism effectively identified 188 of these harmful questions, prompting modifications to the prompt template before presenting them to the LLM. Figure 14 illustrates the scenario where, upon detecting a harmful question, the original system prompt is not adhered to. Instead, a distinct system prompt is utilized. Consequently, the LLM generated responses that adhered to safety protocols, ensuring the absence of harmful or inappropriate content.

Figure 14: Override of the original system prompt when harmful system and user prompts are combined.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-14

Additionally, we subjected the LLM to 50 harmless questions. Prompt-G successfully identified 47 of these questions as harmless. As a result, no alterations were made to the LLM, despite high similarity scores observed in the system prompt as shown in an example in Fig. 15.

Figure 15: The prompts remain unchanged when the questions are harmless.

Download full-size image

DOI: 10.7717/peerj-cs.2374/fig-15

The data presented in the table indicates that Prompt-G demonstrated robustness by accurately identifying 98.95% of harmful questions, thereby facilitating the exclusion of these questions from the LLM’s input. However, there was a 6% detection rate of harmful questions within a dataset primarily composed of harmless questions. This discrepancy may arise from potential semantic confusion, leading the model to misclassify certain harmless questions as harmful.

When the model identified a harmful user prompt in conjunction with a malicious system prompt, the temperature of the chat model was lowered to 0, eliminating the LLM’s creative freedom. This adjustment caused the chat model to produce a response that politely declined to provide any malicious information or reply.

Prompt-G model is evaluated against other baseline methods, including Vanilla supervised fine-tuning (SFT), Aligned SFT, Goal Prioritization (Zhang et al., 2023), and Self Reminder (Wu et al., 2023). The attack success rate (ASR) for our model was determined to be 2.08%. Since our model is primarily designed to detect harmful questions and prompts, with a misidentification rate of 2.08% for the questions, correctly identifying these allows us to prevent the generation of malicious responses. Table 4 show comparison between various baseline models and our model.

Table 4:

Comparision between various baseline models.

Methods	ASR (%)
Vanilla SFT (Zhang et al., 2023)	71.00
Aligned SFT (Zhang et al., 2023)	20.30
Goal prioritization (Zhang et al., 2023)	6.60
Self-reminder (Wu et al., 2023)	4.97
Prompt-G	2.08

DOI: 10.7717/peerj-cs.2374/table-4

In this article, the model was integrated with the quantized version of the Llama 2 13B chat model. This filter can be utilized with any LLM chat model to prevent the generation of malicious responses.

Challenges and future work

Despite achieving its primary objectives, the system holds promise for further exploration and improvement. Several avenues for future research could extend its capabilities and broaden its impact. These avenues could include:

Evaluate the generalizability of the framework: While the system is effective against Self Reminder attacks, it is crucial to assess how well it performs in identifying and mitigating a wider variety of attacks. This could involve testing the framework against other types of jailbreak techniques, such as Role Play and Prompt Injection attacks. Understanding its performance across different attack scenarios would help gauge the framework’s adaptability and robustness in real-world applications.
Investigate framework performance across diverse LLMs: The current evaluation is focused on a specific model, but understanding the framework’s efficacy across a broader range of LLM architectures is important. This would involve applying the framework to models like GPT, PaLM, or Bloom, and assessing its ability to adapt to their unique characteristics, such as varying levels of parameterization, different training data, and architecture-specific behaviors. A successful evaluation across diverse LLMs would demonstrate the framework’s versatility and potential for wide adoption.
Enhance the model filter by increasing the heterogeneity of known attacks and harmful questions: The model filter’s effectiveness depends heavily on the variety of known jailbreak attacks and harmful questions within its database. Expanding the diversity of these datasets would allow the model to detect a broader range of malicious prompts. By incorporating a wider spectrum of attack types and harmful queries, the system can be made more resilient, leading to improved detection accuracy and reduced false positives. As a result, the overall robustness of the model would be significantly enhanced.

By addressing these areas, the framework could become more versatile, resilient, and effective, ultimately contributing to the creation of safer and more reliable AI systems.

Conclusion

After a thorough examination of the jailbreak prompts, it was observed that a small subset of these prompts led the LLM to naturally refrain from producing harmful responses, thereby reducing the necessity for intervention by the Prompt-G filter in such cases. Nonetheless, a substantial majority of the jailbreak prompts still successfully induced unintended and potentially dangerous outputs, emphasizing the critical importance of deploying the Prompt-G filter to counteract such vulnerabilities. When these prompts were integrated with malicious user inputs and subsequently processed through the Prompt-G filter, the LLM consistently generated responses that adhered to safety and ethical guidelines. While the Prompt-G filter was notably effective in detecting prompts that introduced confusion or distraction into the LLM, it also exhibited a minor tendency to incorrectly flag benign prompts as harmful, albeit with a relatively low error rate. This trade-off highlights the balance between robust protection and maintaining response accuracy.

As the technological landscape continues to evolve, this system positions itself as a promising platform for further exploration and refinement. The evolving nature of jailbreak attacks necessitates continuous adaptation of LLMs. While our work demonstrates a promising approach, it’s crucial to acknowledge the growing sophistication of these attacks. To maintain efficacy, LLMs must undergo regular updates and incorporate advancements in detection and mitigation strategies.

[1] Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. 2019. Publicly available clinical bert embeddings. Arxiv preprint

[2] Balestriero R, Cosentino R, Shekkizhar S. 2023. Characterizing large language model geometry solves toxicity detection and generation. Arxiv preprint

[3] Chu J, Liu Y, Yang Z, Shen X, Backes M, Zhang Y. 2024. Comprehensive assessment of jailbreak attacks against LLMs. Arxiv preprint

[4] Dai J, Pan X, Sun R, Ji J, Xu X, Liu M, Wang Y, Yang Y. 2023. Safe rlhf: safe reinforcement learning from human feedback. Arxiv preprint

[5] Das BC, Amini MH, Wu Y. 2024. Security and privacy challenges of large language models: a survey. Arxiv preprint

[6] Hasan A, Rugina I, Wang A. 2024. Pruning for protection: increasing jailbreak resistance in aligned LLMs without fine-tuning. Arxiv preprint

[7] Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, Chen Q, Peng W, Feng X, Qin B, Liu T. 2023. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. Arxiv preprint

[8] Jain N, Schwarzschild A, Wen Y, Somepalli G, Kirchenbauer J, Chiang PY, Goldblum M, Saha A, Geiping J, Goldstein T. 2023. Baseline defenses for adversarial attacks against aligned language models. Arxiv preprint

[9] Jin H, Chen R, Zhou A, Chen J, Zhang Y, Wang H. 2024. Guard: role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. Arxiv preprint

[10] Jing Z, Su Y, Han Y, Yuan B, Liu C, Xu H, Chen K. 2024. When large language models meet vector databases: a survey. Arxiv preprint

[11] Kim H, Yuk S, Cho H. 2024. Break the breakout: reinventing lm defense against jailbreak attacks with self-refinement. Arxiv preprint

[12] Liu Y, He H, Han T, Zhang X, Liu M, Tian J, Zhang Y, Wang J, Gao X, Zhong T, Pan Y, Xu S, Wu Z, Liu Z, Zhang X, Zhang S, Hu X, Zhang T, Qiang N, Liu T, Ge B. 2024. Understanding llms: a comprehensive overview from training to inference. Arxiv preprint

[13] Robey A, Wong E, Hassani H, Pappas GJ. 2023. Smoothllm: defending large language models against jailbreaking attacks. Arxiv preprint

[14] Shen X, Chen Z, Backes M, Shen Y, Zhang Y. 2023. “do anything now”: characterizing and evaluating in-the-wild jailbreak prompts on large language models. Arxiv preprint

[15] Sinaga KP, Yang M-S. 2020. Unsupervised k-means clustering algorithm. IEEE Access 8:80716-80727

[16] Tian Y, Yang X, Zhang J, Dong Y, Su H. 2023. Evil geniuses: delving into the safety of llm-based agents. Arxiv preprint

[17] Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura PS, Lachaux M-A, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. 2023. Llama 2: open foundation and fine-tuned chat models. Arxiv preprint

[18] Wei A, Haghtalab N, Steinhardt J. 2024. Jailbroken: how does LLM safety training fail? Advances in Neural Information Processing Systems 36

[19] Wu F, Xie Y, Yi J, Shao J, Curl J, Lyu L, Chen Q, Xie X. 2023. Defending chatgpt against jailbreak attack via self-reminder. Nature Machine Intelligence 5:1486-1496

[20] Xi Z, Du T, Li C, Pang R, Ji S, Chen J, Ma F, Wang T. 2024. Defending pre-trained language models as few-shot learners against backdoor attacks. Advances in Neural Information Processing Systems 36