ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02\% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (*e.g.*, GPT-4).
Lay Summary: Large language models (LLMs) like ChatGPT can learn new tasks from just a few examples shown in the prompt — this is called in-context learning (ICL). However, this flexibility creates a security vulnerability called ICL backdoor attack: attackers can manipulate the model's behavior by adding malicious examples into these demonstrations. We discovered that when processing these poisoned demonstrations, LLMs simultaneously learn two things: the task-relvant latent concepts and backdoor latent concepts. The model's final behavior depends on which learning signal is stronger — like a competition between good and bad influences. Through theoretical analysis, we found that this vulnerability depends on the balance between task-relevant and backdoor concepts. Based on this insight, we developed ICLShield, a defense method that selects and adds clean demonstrations using confidence and similarity measures. Our method achieves state-of-the-art protection, outperforming existing defenses by 26% on average, and works even with closed-source models like GPT-4, significantly improving AI safety against ICL backdoor attacks.
Primary Area: Deep Learning->Robustness
Keywords: In-context learning backdoor attack, backdor defense, large language models
Submission Number: 5403
Loading