Exploring the Collaborative Advantage of Low-level Information on Generalizable AI-Generated Image Detection

Ziyin Zhou1 Ke Sun1 Zhongxi Chen1 Xianming Lin1
Yunpeng Luo2 Ke Yan2 Shouhong Ding2 Xiaoshuai Sun1
1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,
  Ministry of Education of China, Xiamen University 
2 Tencent YouTu Lab
Abstract

Existing state-of-the-art AI-Generated image detection methods mostly consider extracting low-level information from RGB images to help improve the generalization of AI-Generated image detection, such as noise patterns. However, these methods often consider only a single type of low-level information, which may lead to suboptimal generalization. Through empirical analysis, we have discovered a key insight: different low-level information often exhibits generalization capabilities for different types of forgeries. Furthermore, we found that simple fusion strategies are insufficient to leverage the detection advantages of each low-level and high-level information for various forgery types. Therefore, we propose the Adaptive Low-level Experts Injection (ALEI) framework. Our approach introduces Lora Experts, enabling the backbone network, which is trained with high-level semantic RGB images, to accept and learn knowledge from different low-level information. We utilize a cross-attention method to adaptively fuse these features at intermediate layers. To prevent the backbone network from losing the modeling capabilities of different low-level features during the later stages of modeling, we developed a Low-level Information Adapter that interacts with the features extracted by the backbone network. Finally, we propose Dynamic Feature Selection, which dynamically selects the most suitable features for detecting the current image to maximize generalization detection capability. Extensive experiments demonstrate that our method, finetuned on only four categories of mainstream ProGAN data, performs excellently and achieves state-of-the-art results on multiple datasets containing unseen GAN and Diffusion methods.

1 Introduction

Advanced AIGC technologies, such as GANs [13, 21, 22, 23] and Diffusion models [8, 14, 35, 42], have seen significant progress, raising concerns about misuse, privacy, and copyright issues. To address these concerns, universal AI-generated image detection methods are essential. A major challenge faced by existing detection methods is how to effectively generalize to unseen AI-Generated Images in real-world scenarios. Existing methods [36, 54], which primarily use RGB images, often focus on content information,leading to overfitting on AI-generated fake images in the training set and a significant drop in generalization accuracy on unseen AI-generated images.

Recent studies have demonstrated that incorporating low-level information, which refers to fundamental signal properties like noise patterns and subtle artifacts inherent in images [58, 59], can significantly enhance the generalization of detection models [49, 18, 19, 55, 26, 48]. For example, LNP [26] and NPR [48] achieve state-of-the-art results by leveraging low-level information. LNP extracts noise patterns from spatial images using a well-trained denoising model, while NPR focuses on artifacts from upsampling operations in generative models. These methods study and design specific types of low-level information for detection. However, the diversity of AIGC technologies and the variety of low-level features raise two unresolved but important questions: 1. How do different types of low-level information contribute to the detection of various AIGC forgeries? 2. Is simply incorporating low-level features into existing models sufficient for optimal detection results?

Refer to caption
Figure 1: Radar chart of the average accuracy on various forgery test datasets using (a) different low-level features and (b) different fusion strategies.

To address these question, we conduct two sets of analytical experiments. First, we train detection models using 6 widely used low-level features and evaluate their performance separately on 16 distinct types of AI-Generated methods. We then explore the impact of combining multiple low-level information sources by examining both early and late fusion strategies on these images. The results of these experiments are presented in Fig. 1. Our analysis of these validation experiments yields two key insights: (a) The effectiveness of different low-level information varies significantly across various types of AIGC image forgeries. (b) Simple fusion mechanisms prove inadequate in fully leveraging these low-level features for optimal detection performance. The detailed analysis of these two insights is presented in Sec. 3, with comprehensive experimental results provided in the Appendix. Thus, it is important to design methods for integrating low-level features into detection models effectively.

In this paper, we propose the Adaptive Low-level Experts Injection (ALEI) framework, which adaptively incorporates diverse low-level information into the visual backbone to effectively detect a wide range of AI-Generated image forgeries. Specifically, we train an expert for each type of low-level information using LoRA [17] and develop a cross-attention layer to facilitate feature fusion. To address the potential loss of low-level features during deep transformer modeling, we introduce a low-level information adapter. This adapter extracts low-level features through two convolutional layers and maintains ongoing interaction with the backbone’s features via our custom-designed injector and extractor. For the classification, we implement dynamic feature selection, which can adaptively choose the relevant features beneficial for detecting the unseen AI-generated images.

The main contributions can be summarized as follows:

  • We offer key insights into the effectiveness of various low-level features for detecting AI-generated images. Our findings demonstrate that different types of low-level information generalize differently across various AIGC forgery types, and simple fusion strategies are inadequate for achieving optimal detection performance.

  • We propose the Adaptive Low-level Experts Injection (ALEI) framework, a novel approach that adaptively integrates diverse low-level information into the visual backbone. Our framework adds a LoRA expert for each type of low-level information and employs a cross-attention layer for fusion at intermediate layers. The Low-level Information Adapter maintains and effectively fuses low-level features during the forward pass of the visual backbone, while dynamic feature selection chooses the appropriate detection features for the current AIGC forgery types.

  • Experimental results demonstrate that our method achieves competitive performance with state-of-the-art methods across multiple AI-generated image detection benchmark datasets.

2 Related Work

2.1 AI-Generated Image Detection

AI-generated image detection methods can be broadly categorized into high-level-information based and low-level-information based mehtods.

High-level Based Methods. Considering that images can provide semantic high-level features through subsequent modeling by convolutional networks [28], we refer to RGB images as high-level information. Early researches utilize images as input and trains binary classification models for GAN-Generated image detection. For instance, Wang et al. [54] uses ProGAN images and real images as the training set, achieving promising results across multiple GAN methods. Rossler et al. [43] trains an Xception model to identify deepfake facial images, while Chai et al. [3] focuses on detecting recognizable regions within images. More recently, Ojha et al. [36] achieves good generalization to diffusion models by finetuning the fully connected layers of a CLIP’s ViT-L backbone. Building upon this approach, Liu et al. [27] further enhances the detection method’s generalization by considering CLIP’s text encoding embeddings and introducing frequency-related adapters into the image encoder.

Low-level Based Methods. Following the descriptions in prior works [51, 29], we refer to the noise patterns extracted from RGB images as low-level information. Since directly using high-level RGB images as the training set [54] often results in limited generalization to unseen AI-Generated images. Some studies attempt to find universal low-level forgery information based on high-level images [32, 26, 18, 61, 49, 55, 48]. Luo et al. [32] utilizes SRM filters [12] to extract high-frequency features, enhancing the generalization of face forgery detection. Jeong et al. [18] amplifies artifacts using high-frequency filters to achieve better detection performance. Liu et al. [26] extracts noise from images using a denoising network and Zhong et al. [61] uses this noise for detection baseline. Tan et al. [49] uses gradient maps generated from discriminator pretrained on StyleGAN for detection. Zhong et al. [61] trains models based on arrangements of high-frequency features extracted by SRM filters in both adversarial and benign texture regions. Wang et al. [55] utilizes an ADM model for image reconstruction and use the difference between the reconstructed and original images (DIRE) for classification. Tan et al. [48] proposes NPR as the low-level information of the upsampling process for detection, achieving impressive generalization across multiple forgery types.

2.2 Low-Level Fusion in Detection tasks.

Low-level information plays a crucial role in tasks that are difficult for the human eye to perceive. Therefore, many studies explore how incorporating low-level information as input can enhance the performance of methods that use only high-level information. Wang et al. [53] guides the detection model to detect camouflaged objects by incorporating depth maps into the detection network based on RGB images. Guillaro et al. [15] trains a noise network called Noiseprint using contrastive learning loss to detect image manipulation traces, and then integrate the traces and images into a transformer network for classification and segmentation. Triaridis et al. [51] employs multiple low-level features for adaptive early fusion in the input module of the transformer, achieving state-of-the-art results on multiple datasets. Liu et al. [29] develops a universal framework for detecting various low-level structures. [32, 47] introduces high-frequency features using SRM [12] through designed fusion modules into the high-level detection branch, applied to deepfake detection.  [34] combines RGB and frequency domain information using a two-stream network to detect processed face images and videos. However, in the AI-Generated image detection, although many methods emerge using low-level information instead of high-level images for generalization, detection methods that combine multiple low-level information and high-level information remain unexplored.

3 Analysis of Low-level Information

To further investigate the phenomena highlighted in the introduction, we conducted two sets of experiments to analyze the effectiveness of various low-level features and their fusion strategies in AI-Generated image detection.

3.1 Evaluation of Individual Low-level Features

Experimental Setup: We investigate 6 types of low-level information from various domains: SRM [12], DnCNN [6], NPR [48], LNP [26], Bayar [1], and NoisePrint [15]. Following the standard paradigm in AI-Generated image detection [36, 54], we train our model on a dataset comprising only ProGAN and real images, and subsequently test on other AI-Generated images using the AIGCDetectBenchmark [61], which includes 16 AI-Generated methods. We employ the visual backbone of CLIP [40] as the backbone, applying LoRA [17] to train the QKV matrix weights in the attention layers. The classification head is optimized using binary cross-entropy loss.

Results and Analysis: The detailed results are presented in Fig. 1 (a) and in the Appendix. NPR, DnCNN, and NoisePrint demonstrate strong generalization in detecting unseen AI-generated images. Image-based methods achieve superior performance on GAN datasets but showe limitations on Diffusion-based datasets. Different low-level information varies in their generalization across different AIGC methods: NPR excelled in detecting mainstream GAN methods, particularly StyleGAN, while DnCNN and NoisePrint performed better on diffusion-based methods. Specifically, DnCNN excels at detecting DDPM-based generation methods such as ADM and Glide, whereas NoisePrint demonstrates sensitivity to LDM-based methods, such as Stable Diffusion.

Conclusion: Experiments lead us to the following insight: The effectiveness of different low-level information varies significantly across various types of AIGC image forgeries.

3.2 Evaluation of Simple Fusion Strategies

Experimental Setup: To explore the potential of combining multiple low-level information types, we use two simple fusion strategies: (1). Early Fusion: After embedding each input using learnable 1×1111\times 11 × 1 convolutional layers in the early stages of the backbone, a simple addition operation fuses the inputs. (2). Late Fusion: After extracting features for each input with the backbone, we concatenate the feature vectors and use a learnable classification head for training. Both of the above backbones are trained using LoRA [17].

Results and Analysis: The results are presented in Fig. 1(b) and in the Appendix. Early Fusion appeared to confuse some key features, leading to a loss of generalization. Late Fusion, while showing strong results, still suffered from insufficient utilization, failing to match the generalization of individual low-level information types for certain AI-generated images.

Conclusion: Simple fusion methods prove inadequate in fully leveraging these low-level features for optimal detection performance across various AIGC forgery types.

Based on these findings, we propose the Adaptive Low-level Experts Injection (ALEI) framework. Given the superior performance of NPR, DnCNN, and NoisePrint, and following the principle of Occam’s Razor, we conduct further experiments using only these 3 low-level information. The integration of additional low-level information is also feasible within our framework, which we discuss further in Appendix. The method will be presented in subsequent sections.

Refer to caption
Figure 2: The overall framework of our proposed method. Our method consists of three main components: (a) Cross-Low-level LoRA Transformer Layer, (b) Low-Level Information Interaction Adapter, and (c) Dynamic Feature Selection. These modules will be explained in the methods section.

4 Methodology

4.1 Overview

Given an input image IH×W×3𝐼superscript𝐻𝑊3I\in\mathbb{R}^{H\times W\times 3}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W denote the height and width respectively, we extract multiple low-level information C={C1,C2,,CM}𝐶subscript𝐶1subscript𝐶2subscript𝐶𝑀C=\{C_{1},C_{2},...,C_{M}\}italic_C = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, where each CiH×W×3,i=1,2,3,,mformulae-sequencesubscript𝐶𝑖superscript𝐻𝑊3𝑖123𝑚C_{i}\in\mathbb{R}^{H\times W\times 3},i=1,2,3,...,mitalic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT , italic_i = 1 , 2 , 3 , … , italic_m. Following UniFD [36], our approach uses the CLIP’s visual backbone(ViT-L/14) in Fig. 2. To enable the model, pretrained on high-level images, to accept various low-level information inputs and ensure effective integration, while avoiding insufficient fusion either in the early or late stages, we transform the original transformer layer into a Cross-Low-level Expert LoRA Transformer Layer, which will be introduced in Sec. 4.2. Furthermore, to prevent the loss of low-level input characteristics in deep transformer modeling, we employ a low-level information interaction adapter. The adapter further injection low-level information into the ViT for enhanced interaction, as discussed in Sec. 4.3. Finally, to select suitable features for different types of forgeries, we propose the Dynamic Feature Selection method to choose the most appropriate low-level features for the current type of forgery, which will be detailed in Sec. 4.4. The overall training phase of our framework will be presented in Sec. 4.5.

4.2 Cross-Low-level Transformer Layer

In our approach, we avoid merging features using straightforward fusion techniques. Instead, we strive to preserve the unique characteristics of each low-level information while capturing the interactions and influences between them. For the M+1𝑀1M+1italic_M + 1 different low-level inputs with the high-level image input I𝐼Iitalic_I denoted as C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and added to the set C𝐶Citalic_C, Cj,(j=0,1,2,,M)subscript𝐶𝑗𝑗012𝑀C_{j},(j=0,1,2,...,M)italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ( italic_j = 0 , 1 , 2 , … , italic_M ), the visual encoder initially transforms the input tensors of size H×W×3superscript𝐻𝑊3\mathbb{R}^{H\times W\times 3}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT into D𝐷Ditalic_D-dimensional image features F0j(1+L)×Dsuperscriptsubscript𝐹0𝑗superscript1𝐿𝐷F_{0}^{j}\in\mathbb{R}^{(1+L)\times D}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_L ) × italic_D end_POSTSUPERSCRIPT, where 1 represents the CLS token of the image, and L=H×WP2𝐿𝐻𝑊superscript𝑃2L=\frac{H\times W}{P^{2}}italic_L = divide start_ARG italic_H × italic_W end_ARG start_ARG italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG with P𝑃Pitalic_P representing the number of patches. The input features for the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT information Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT through the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT transformer layer are denoted as Fij(1+L)×D,i=0,1,2,,Nformulae-sequencesuperscriptsubscript𝐹𝑖𝑗superscript1𝐿𝐷𝑖012𝑁F_{i}^{j}\in\mathbb{R}^{(1+L)\times D},i=0,1,2,...,Nitalic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_L ) × italic_D end_POSTSUPERSCRIPT , italic_i = 0 , 1 , 2 , … , italic_N, where N𝑁Nitalic_N denotes the number of layers in the transformer. The transformer module takes the patch-embedded features F0jsuperscriptsubscript𝐹0𝑗F_{0}^{j}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT as input for each low-level information.

Considering the distinctiveness of each information, we aim to embed the knowledge of each information into the CLIP visual backbone without affecting the original pretrained weights. We employ the fine-tuning technique known as Lora [17], which is widely used in large language models and diffusion models, to incorporate modal knowledge through an additional plug-and-play module.

Each expert layer consists of our designed Multi-Lora-Expert Layer in Fig. 2(a), Self-Attention, residual connections, Layer Normalization and a FFN layer. In the Multi-Lora-Expert Layer at layer i𝑖iitalic_i, we employ Lora to process features specific to each input by designing different Lora experts. The computation is as follows:

F^i(j)=WqkvFi(j)+αrΔWjFi(j)=WqkvFi(j)+αrBjAjFi(j)superscriptsubscript^𝐹𝑖𝑗subscript𝑊𝑞𝑘𝑣superscriptsubscript𝐹𝑖𝑗𝛼𝑟Δsubscript𝑊𝑗superscriptsubscript𝐹𝑖𝑗subscript𝑊𝑞𝑘𝑣superscriptsubscript𝐹𝑖𝑗𝛼𝑟subscript𝐵𝑗subscript𝐴𝑗superscriptsubscript𝐹𝑖𝑗\small\hat{F}_{i}^{(j)}=W_{qkv}\cdot F_{i}^{(j)}+\frac{\alpha}{r}\Delta W_{j}% \cdot F_{i}^{(j)}=W_{qkv}\cdot F_{i}^{(j)}+\frac{\alpha}{r}B_{j}A_{j}\cdot F_{% i}^{(j)}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG roman_Δ italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT (1)

Here, F^i(j)superscriptsubscript^𝐹𝑖𝑗\hat{F}_{i}^{(j)}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT represents the output of Fi(j)superscriptsubscript𝐹𝑖𝑗F_{i}^{(j)}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT after processing by the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT Lora expert and we set r=4𝑟4r=4italic_r = 4 and α=8𝛼8\alpha=8italic_α = 8, Wqkvsubscript𝑊𝑞𝑘𝑣W_{qkv}italic_W start_POSTSUBSCRIPT italic_q italic_k italic_v end_POSTSUBSCRIPT denotes the matrix weights of the qkv in the attention layer and ΔWj=BjAjΔsubscript𝑊𝑗subscript𝐵𝑗subscript𝐴𝑗\Delta W_{j}=B_{j}A_{j}roman_Δ italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the trainable parameter of the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT Lora expert. Next, F^i(j)superscriptsubscript^𝐹𝑖𝑗\hat{F}_{i}^{(j)}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT serves as the input for the self-attention Q,K,V𝑄𝐾𝑉Q,K,Vitalic_Q , italic_K , italic_V in the original CLIP, and the output after the FFN layer is denoted as F¯i(j)superscriptsubscript¯𝐹𝑖𝑗\overline{F}_{i}^{(j)}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. Noting that the features of each information are computed in parallel without interaction, we employ a cross-attention layer in the original output section to facilitate interaction between modalities, as computed by:

F¯i=Concatnate[F¯i(j), 0jC]Fi+1=F¯i+βiMHA(LN(F¯i),LN(F¯i),LN(F¯i))subscript¯𝐹𝑖Concatnatedelimited-[]superscriptsubscript¯𝐹𝑖𝑗 0𝑗𝐶subscript𝐹𝑖1subscript¯𝐹𝑖subscript𝛽𝑖MHALNsubscript¯𝐹𝑖LNsubscript¯𝐹𝑖LNsubscript¯𝐹𝑖\begin{split}\overline{F}_{i}&=\text{Concatnate}[\overline{F}_{i}^{(j)},\ 0% \leq j\leq C]\\ {F}_{i+1}&=\overline{F}_{i}+\beta_{i}\text{MHA}(\text{LN}(\overline{F}_{i}),% \text{LN}(\overline{F}_{i}),\text{LN}(\overline{F}_{i}))\end{split}start_ROW start_CELL over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = Concatnate [ over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , 0 ≤ italic_j ≤ italic_C ] end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_CELL start_CELL = over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT MHA ( LN ( over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , LN ( over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , LN ( over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_CELL end_ROW (2)

Here, LN(·) represents LayerNorm, and the attention layer MHA(·) is suggested to use a multi-head attention mechanism with the number of heads set to 4. Furthermore, we apply a learnable vector βi(1+L)×Dsubscript𝛽𝑖superscript1𝐿𝐷\beta_{i}\in\mathbb{R}^{(1+L)\times D}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_L ) × italic_D end_POSTSUPERSCRIPT to balance the output of the attention layer with the input features, initially set to 0. This initialization strategy ensures that the unique features of each modality do not undergo drastic changes due to the injection of features from other modalities and adaptively integrates features related to forgery types contained in other modalities.

4.3 Low-level Information Interaction Adapter

Many work [60, 38, 57] suggests that the deeper layers of transformers might lead to the loss of low-level information, focusing instead on the learning of semantic information. Inspired by [4], to prevent our framework from losing critical classification features related to forgery types during the fusion of low-level information, we introduce a low-level information interaction adapter. This adapter is designed to capture low-level information priors and to enhance the significance of low-level information within the backbone. It operates parallel to the patch embedding layer of the CLIP image encoder and does not alter the architecture of the CLIP visual backbone. Unlike the vit-adapter [4], which injects spatial priors, our adapter injects low-level priors.

As illustrated, we utilize the first two blocks of ResNet50 [16], followed by global pooling and several 1×1111\times 11 × 1 convolutions applied at the end to project the low-level information C1,C2,,CMsubscript𝐶1subscript𝐶2subscript𝐶𝑀C_{1},C_{2},...,C_{M}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT into D𝐷Ditalic_D dimensions. Through this process, we obtain the feature vector G0Dsubscript𝐺0superscript𝐷G_{0}\in\mathbb{R}^{D}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT extracted from the low-level encoder. To better integrate our features into the backbone, we design a cross-attention-based low-level feature injector and a low-level feature extractor.

Low-level Feature Injector. This module is used to inject low-level priors into the ViT. As shown in Fig. 2(b), for the output from each modality feature of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer of CLIP using ViT-L, the features are concatenated into a feature vector Fi(1+M)(1+L)×Dsubscript𝐹𝑖superscript1𝑀1𝐿𝐷F_{i}\in\mathbb{R}^{(1+M)\cdot(1+L)\times D}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_M ) ⋅ ( 1 + italic_L ) × italic_D end_POSTSUPERSCRIPT, which serves as the query for computing cross-attention. The low-level feature Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT acts as the key and value in injecting into the modal feature Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, represented by the following equation:

F~i=Fi+γiMHA(LN(Fi),LN(Gi),LN(Gi))subscript~𝐹𝑖subscript𝐹𝑖subscript𝛾𝑖MHALNsubscript𝐹𝑖LNsubscript𝐺𝑖LNsubscript𝐺𝑖\tilde{F}_{i}=F_{i}+\gamma_{i}\text{MHA}(\text{LN}(F_{i}),\text{LN}(G_{i}),% \text{LN}(G_{i}))over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT MHA ( LN ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , LN ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , LN ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (3)

As before, LN and MHA operations respectively represent LayerNorm and multi-head attention mechanisms, with the number of heads set to 4. Similarly, we use a learnable vector γiDsubscript𝛾𝑖superscript𝐷\gamma_{i}\in\mathbb{R}^{D}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT to balance the two different features.

Modal Feature Extractor. After injecting the low-level priors into the backbone, we perform the forward propagation process. We concatenate the output of each modality feature of the (i+1)thsuperscript𝑖1𝑡{(i+1)}^{th}( italic_i + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer to obtain the feature vector Fi+1subscript𝐹𝑖1F_{i+1}italic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT and then apply a module composed of cross-attention and FFN to extract modal features, as shown in Fig. 2(b). This process is represented by the following equations:

G~i=Gi+ηiMHA(LN(Gi),LN(Fi+1),LN(Fi+1))subscript~𝐺𝑖subscript𝐺𝑖subscript𝜂𝑖MHALNsubscript𝐺𝑖LNsubscript𝐹𝑖1LNsubscript𝐹𝑖1\tilde{G}_{i}=G_{i}+\eta_{i}\text{MHA}(\text{LN}(G_{i}),\text{LN}(F_{i+1}),% \text{LN}(F_{i+1}))over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT MHA ( LN ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , LN ( italic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) , LN ( italic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ) (4)
Gi+1=G~i+FFN(LN(G~i))subscript𝐺𝑖1subscript~𝐺𝑖FFNLNsubscript~𝐺𝑖{G}_{i+1}=\tilde{G}_{i}+\text{FFN}(\text{LN}(\tilde{G}_{i}))italic_G start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + FFN ( LN ( over~ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (5)

Here, the low-level feature GiDsubscript𝐺𝑖superscript𝐷G_{i}\in\mathbb{R}^{D}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT serves as the query, and the output Fi+1(1+M)(1+L)×Dsubscript𝐹𝑖1superscript1𝑀1𝐿𝐷F_{i+1}\in\mathbb{R}^{(1+M)\cdot(1+L)\times D}italic_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_M ) ⋅ ( 1 + italic_L ) × italic_D end_POSTSUPERSCRIPT from backbone acts as the key and value. Similar to the low-level feature injector, we use a learnable vector ηiDsubscript𝜂𝑖superscript𝐷\eta_{i}\in\mathbb{R}^{D}italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT to balance the two different features. Gi+1subscript𝐺𝑖1G_{i+1}italic_G start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is then used as the input for the next low-level feature injector.

4.4 Dynamic Feature Selection

As mentioned in the introduction, since different features are often sensitive to different types of forgeries, simple feature concatenation or averaging followed by training with a unified classification head might lose some feature’s advantages for detecting certain types of forgeries. To better integrate low-level features for generalizing to various forgery type detections, inspired by the mixed experts routing dynamic feature selection [45], we introduce a dynamic modal feature selection mechanism at the final output classification feature part of the model. Specifically, we extract the cls tokens of the final output, concatenate them, and denote this as Fcls(1+M)Dsubscript𝐹𝑐𝑙𝑠superscript1𝑀𝐷F_{cls}\in\mathbb{R}^{(1+M)\cdot D}italic_F start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_M ) ⋅ italic_D end_POSTSUPERSCRIPT, which serves as the input for the dynamic router. The dynamic router employs a learnable fully connected neural network, with its matrix parameter defined as WRouter(1+M)D×(1+M)subscript𝑊𝑅𝑜𝑢𝑡𝑒𝑟superscript1𝑀𝐷1𝑀W_{Router}\in\mathbb{R}^{(1+M)\cdot D\times(1+M)}italic_W start_POSTSUBSCRIPT italic_R italic_o italic_u italic_t italic_e italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_M ) ⋅ italic_D × ( 1 + italic_M ) end_POSTSUPERSCRIPT. The probability distribution for selecting each modal feature is computed as follows:

p=SoftMax(WRouterFcls)𝑝SoftMaxsubscript𝑊𝑅𝑜𝑢𝑡𝑒𝑟subscript𝐹𝑐𝑙𝑠p=\text{SoftMax}(W_{Router}F_{cls})italic_p = SoftMax ( italic_W start_POSTSUBSCRIPT italic_R italic_o italic_u italic_t italic_e italic_r end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ) (6)

For each feature, a corresponding classification head headi,i=0,1,2,,Mformulae-sequencesubscripthead𝑖𝑖012𝑀\text{head}_{i},i=0,1,2,...,Mhead start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 0 , 1 , 2 , … , italic_M, is prepared. The final classification result y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is obtained through the following equation:

P^(y)=i=0Mpiheadi(Fclsi)^𝑃𝑦superscriptsubscript𝑖0𝑀subscript𝑝𝑖subscripthead𝑖superscriptsubscript𝐹𝑐𝑙𝑠𝑖\hat{P}(y)=\sum_{i=0}^{M}p_{i}\cdot\text{head}_{i}(F_{cls}^{i})over^ start_ARG italic_P end_ARG ( italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (7)

Here, Fclsisuperscriptsubscript𝐹𝑐𝑙𝑠𝑖F_{cls}^{i}italic_F start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the cls token of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT output feature. By adaptively learning a dynamic modal feature selection module, we enable the selection of suitable features for integration, thus allowing the classification to be tailored to the forgery type of the current image under detection. To balance the selection of different experts, we use entropy regularization loss as an additional constraint, as shown below:

moe=i=0Mpilogpisubscript𝑚𝑜𝑒superscriptsubscript𝑖0𝑀subscript𝑝𝑖subscript𝑝𝑖\displaystyle\mathcal{L}_{moe}=-\sum_{i=0}^{M}p_{i}\log p_{i}caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (8)
Generator CNNDet GramNet LNP LGrad DIRE-G DIRE-D UnivFD PatchCraft Ours
ProGAN 100.00 99.99 99.95 99.83 95.19 52.75 99.81 100.00 100.00
StyleGAN 90.17 87.05 92.64 91.08 83.03 51.31 84.93 92.77 98.35
BigGAN 71.17 67.33 88.43 85.62 70.12 49.70 95.08 95.80 94.51
CycleGAN 87.62 86.07 79.07 86.94 74.19 49.58 98.33 70.17 97.03
StarGAN 94.60 95.05 100.00 99.27 95.47 46.72 95.75 99.97 100.00
GauGAN 81.42 69.35 79.17 78.46 67.79 51.23 99.47 71.58 95.19
StyleGAN2 86.91 87.28 93.82 85.32 75.31 51.72 74.96 89.55 98.88
whichfaceisreal 91.65 86.80 50.00 55.70 58.05 53.30 86.90 85.80 75.71
ADM 60.39 58.61 83.91 67.15 75.78 98.25 66.87 82.17 88.43
Glide 58.07 54.50 83.50 66.11 71.75 92.42 62.46 83.79 91.53
Midjourney 51.39 50.02 69.55 65.35 58.01 89.45 56.13 90.12 91.56
SDv1.4 50.57 51.70 89.33 63.02 49.74 91.24 63.66 95.38 93.28
SDv1.5 50.53 52.16 88.81 63.67 49.83 91.63 63.49 95.30 93.38
VQDM 56.46 52.86 85.03 72.99 53.68 91.90 85.31 88.91 90.94
wukong 51.03 50.76 86.39 59.55 54.46 90.90 70.93 91.07 89.46
DALLE2 50.45 49.25 92.45 65.45 66.48 92.45 50.75 96.60 93.32
Average 69.73 68.43 85.28 75.11 67.90 72.70 76.80 89.85 93.29
Table 1: The detection accuracy comparison between our approach and baselines. Among all detectors, the best result and the second-best result are denoted in boldface and underlined, respectively. The complete table will be presented in the Appendix.

4.5 Training phase

We first train Lora Expert and the low-level information encoder for each type of low-level information and the high-level image information to ensure that the model learns knowledge relevant to AI-Generated image detection from both low-level and high-level information. Let the true label be y𝑦yitalic_y and the model’s prediction be P^(y)^𝑃𝑦\hat{P}(y)over^ start_ARG italic_P end_ARG ( italic_y ). The training is performed using the cross-entropy loss as defined in Eq.9. Subsequently, we load these pre-trained weights into our framework and further train our carefully designed fusion module to ensure the adequate and appropriate fusion of each type of low-level and high-level information. Our final fused prediction results are given in Eq.7, and we optimize our overall framework using Eq.10 as well, the loss is composed of the classification loss (Eq.9) and the expert balance regularization loss (Eq.8) weighted together. In our experiments, we set λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1.

clssubscript𝑐𝑙𝑠\displaystyle\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT =ylogP^(y)(1y)log(1P^(y))absent𝑦^𝑃𝑦1𝑦1^𝑃𝑦\displaystyle=-y\cdot\log\hat{P}(y)-(1-y)\cdot\log(1-\hat{P}(y))= - italic_y ⋅ roman_log over^ start_ARG italic_P end_ARG ( italic_y ) - ( 1 - italic_y ) ⋅ roman_log ( 1 - over^ start_ARG italic_P end_ARG ( italic_y ) ) (9)
totalsubscript𝑡𝑜𝑡𝑎𝑙\displaystyle\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT =cls+λmoeabsentsubscript𝑐𝑙𝑠𝜆subscript𝑚𝑜𝑒\displaystyle=\mathcal{L}_{cls}+\lambda\mathcal{L}_{moe}= caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_m italic_o italic_e end_POSTSUBSCRIPT (10)
Method AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN SNGAN Mean
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDet 51.1 83.7 50.2 44.9 81.5 97.5 71.1 94.7 72.9 94.4 53.3 82.1 62.7 90.4 62.3 82.9
Frank 65.0 74.4 39.4 39.9 31.0 36.0 41.1 41.0 38.4 40.5 69.2 96.2 48.4 47.9 47.5 54.7
Durall 39.9 38.2 48.2 30.9 60.9 67.2 50.1 51.7 59.5 65.5 80.0 88.2 54.8 58.9 60.3 63.3
Patchfor 68.0 92.9 97.1 100.0 97.8 99.9 93.6 98.2 97.9 100.0 99.6 100.0 97.6 99.8 90.1 95.4
F3Net 85.2 94.8 87.1 97.5 89.5 99.8 67.1 83.1 73.7 99.6 98.8 100.0 51.6 93.6 75.4 93.1
SelfBlend 63.1 66.1 56.4 59.0 75.1 82.4 79.0 82.5 68.6 74.0 73.6 77.8 61.6 65.0 65.8 69.7
GANDet 57.4 75.1 67.9 100.0 67.8 99.7 67.6 92.4 67.7 99.3 60.9 86.2 66.7 90.6 66.1 91.6
LGrad 68.6 93.8 69.9 89.2 50.3 54.0 71.1 82.0 57.5 67.3 89.1 99.1 78.0 87.4 68.6 80.8
UnivFD 78.5 98.3 72.0 98.9 77.6 99.8 77.6 98.9 77.6 99.7 78.2 98.7 77.6 98.7 77.6 98.8
NPR 83.0 96.2 99.0 99.8 98.7 99.0 94.5 98.3 98.6 99.0 99.6 100.0 88.8 97.4 93.2 96.6
Ours 86.2 97.8 100.0 100.0 100.0 100.0 98.6 99.9 99.3 99.8 100.0 100.0 90.4 98.7 95.3 98.1
Table 2: Cross-GAN-Sources Evaluation on the GANGenDetection [50]. Partial results from [48]. The complete table will be presented in the Appendix.
LE LIIA CLA DFS Acc. A.P.
80.8 87.6
\checkmark 89.0 93.7
\checkmark \checkmark 91.7 96.0
\checkmark \checkmark 90.6 95.3
\checkmark \checkmark \checkmark 92.8 97.8
\checkmark \checkmark \checkmark \checkmark 93.3 98.4
Table 3: Performance of different combinations of model compoents.
Method DALLE Glide_100_10 Glide_50_27 ADM LDM_100 LDM_200 Mean
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDet 51.8 61.3 53.3 72.9 54.2 76.0 54.9 66.6 51.9 63.7 52.0 64.5 52.8 67.4
Frank 57.0 62.5 53.6 44.3 52.0 42.3 53.4 52.5 56.6 51.3 56.4 50.9 54.5 49.6
Durall 55.9 58.0 54.9 52.3 51.7 49.9 40.6 42.3 62.0 62.6 61.7 61.7 54.3 54.0
Patchfor 79.8 99.1 87.3 99.7 84.9 98.8 74.2 81.4 95.8 99.8 95.6 99.9 86.8 97.2
F3Net 71.6 79.9 88.3 95.4 88.5 95.4 69.2 70.8 74.1 84.0 73.4 83.3 79.1 86.5
SelfBlend 52.4 51.6 58.8 63.2 64.2 68.3 58.3 63.4 53.0 54.0 52.6 51.9 56.3 58.7
GANDet 67.2 83.0 51.2 52.6 51.7 53.5 49.6 49.0 54.7 65.8 54.9 65.9 54.3 60.1
LGrad 88.5 97.3 89.4 94.9 90.7 95.1 86.6 100.0 94.8 99.2 94.2 99.1 90.9 97.2
UnivFD 89.5 96.8 90.1 97.0 91.1 97.4 75.7 85.1 90.5 97.0 90.2 97.1 86.9 94.5
NPR 94.5 99.5 98.2 99.8 98.2 99.8 75.8 81.0 99.3 99.9 99.1 99.9 95.2 97.4
FAFormer 98.8 99.8 94.2 99.2 94.7 99.4 76.1 92.0 98.7 99.9 98.6 99.8 93.8 95.5
Ours 97.7 99.7 97.9 99.2 98.6 99.9 90.1 96.4 99.5 99.9 98.9 99.3 97.3 99.1
Table 4: Cross-Diffusion-Sources Evaluation on the diffusion test set of UniversalFakeDetect [36]. Partial results from [27, 48]. The complete table will be presented in the Appendix.
Image NPR DnCNN NoisePrint Acc. A.P.
\checkmark 85.3 91.8
\checkmark 84.6 91.4
\checkmark 83.9 89.6
\checkmark 85.1 90.1
\checkmark \checkmark 89.1 93.2
\checkmark \checkmark \checkmark 91.3 95.1
\checkmark \checkmark \checkmark \checkmark 93.3 98.4
Table 5: Performance of different combinations of low-level information used in the main text.

5 Experiment

5.1 Experimental Setups

Training Dataset. To ensure a fair comparison, we adhere to the training set proposed by [54]. Testing is then conducted on other unseen forgery types, such as those generated by different GANs or new diffusion models. This training set comprises 20 different categories, with each category containing 18,000 synthetic images generated by ProGAN. Additionally, an equal number of real images sampled from the LSUN dataset are included. As in previous methods [18, 19, 48, 27], we restrict the training set to four categories: car, cat, chair, and horse.

Testing Dataset. To further evaluate the generalization capability of the proposed method in real-world scenarios, we employ various real-world images and images generated by diverse GANs and Diffusions. The evaluation dataset follows the test datasets proposed by previous methods and primarily includes the following datasets:CNNDetectionBenchmark [54], GANGenDetectionBenchmark [50], UniversalFakeDetectBenchmark [36] and AIGCDetectBenchmark [61]. Although we achieve state-of-the-art (SOTA) results on other benchmarks, our analysis in the main text and subsequent ablation studies are conducted specifically on the AIGCDetectBenchmark. This benchmark incorporates the widely used GenImage [64] dataset for AI-Generated image detection, along with data from up to 16 different AI generation methods, allowing for a comprehensive evaluation. More details about testing dataset are provided in the Appendix.

SOTA Methods Details. This paper aims to establish a framework that integrates multiple low-level and high-level features to enhance the generalization capabilities of AI-generated image detection. To this end, we conduct extensive comparisons with several state-of-the-art methods that explore generalization in AI-generated image detection, including: CNNDet [54], FreDect [10], Fusing [20], GramNet [31], Frank [11], Durall [9], Patchfor [3], F3Net [39], SelfBlend [46], GANDet [33], FrePGAN [19],BiHPF [18], LNP [26], LGrad [49], DIRE-G [55], DIRE-D [55], UnivFD [36], PatchCraft [61], FAFormer [27], and NPR [48]. In this context, DIRE-D refers to the results obtained using the pretrained weights from the original DIRE model, trained on the ADM dataset, while DIRE-G refers to the results obtained from retraining the DIRE model using weights trained on the ProGAN dataset.

Implementation Details. Our main training and testing settings largely follow previous research. First, the input images are resized to 256×256256256256\times 256256 × 256, then center-cropped to 224×224224224224\times 224224 × 224. During training, we use a random cropping strategy, while for testing, only center cropping is applied. We train our method using the Adam optimizer with parameters (0.9, 0.999), a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and a batch size of 32. Our method is implemented using the PyTorch framework on four Nvidia GeForce RTX 3090 GPUs. The training period is set to 10 epochs. The overall training task can be completed within 24 hours. We report the average accuracy (Acc.) and average precision (A.P.) during the evaluation for each forgery type. More details related to our method and baseline methods are provided in the Appendix.

5.2 Compared with SOTA methods

Comparisons on AIGCDetectBenchmark. Tab. 1 reports results of our method and baseline methods on AIGCDetectBenchmark. Our method outperforms previous state-of-the-art methods by 3.44%percent3.443.44\%3.44 % across 16 different forgery datasets. This notable achievement is largely due to the generalization capability offered by diverse low-level features for AI-generated image detection, along with the effective integration of low-level information containing various forensic clues. This enables our method to generalize well to unseen fake images using a limited amount of ProGAN dataset. Furthermore, we analyzed the time efficiency and overall parameters of our method. The results, presented in Tab. LABEL:tab_test_1, demonstrate that we achieve a balance between efficiency and accuracy at the same parameter level of the backbone when compared to SOTA methods.

Comparison on GANGenDetectionBenchmark. Tab. 3 evaluates the Acc. and A.P. metrics on GANGenDetection, with test results on CNNDetection provided in the Appendix. The test datasets were unseen during training, with ProGAN in the test set comprising 20 classes, compared to only 4 in the training set. Our method outperforms several baseline methods and achieves comparable results to the state-of-the-art methods NPR [27], improving average accuracy by 2.1%percent2.12.1\%2.1 % and 1.5%percent1.51.5\%1.5 %. This indicates that our method, by incorporating multiple low-level information, enhances detection performance uniformly across various GAN generation methods.

Comparison on UniversalFakeDetectBenchmark. Tab. 5 evaluates the Acc. and A.P. metrics on the Diffusions dataset from UniversalFakeDetect. Given that our method is trained on ProGAN, this setting poses a challenge as the fake images originate from different Diffusion methods, which differ significantly from GAN generation processes. Nevertheless, our method exhibits strong generalization capabilities across various Diffusion models. Compared to state-of-the-art methods NPR [27] and FAFormer [48], our method enhances Acc. by 2.0%percent2.02.0\%2.0 % and 3.4%percent3.43.4\%3.4 %, respectively, and A.P. by 1.7%percent1.71.7\%1.7 % and 3.6%percent3.63.6\%3.6 %, respectively. These results strongly suggest that the low-level information utilized contains critical clues that generalize well to diffusion detection, resulting in improved performance.

Refer to caption
Figure 3: T-SNE visualization of features extracted by the classifier [52]. Blue and red represent the features of real images and fake images, respectively. The rightmost column shows the distribution bar chart of the selected different features when facing different forgery types.

5.3 Ablation Study

Combination of different low-level information.To demonstrate the effectiveness of the low-level information used in our method, we compared its performance with different low-level information in Tab.5. Each type of low-level information individually achieved over 83%percent8383\%83 % Acc. and 89%percent8989\%89 % A.P. on the test set, indicating generalization performance on synthetic images. As we progressively added low-level information, performance improved, with an overall enhancement of 8.0%percent8.08.0\%8.0 % in Acc. and 6.6%percent6.66.6\%6.6 % in A.P. We visualized features of different low-level information using t-SNE [52] plots for various synthetic image methods (StyleGAN, BigGAN, ADM, Stable Diffusion) and the distribution of low-level features for different forgery types in Fig. 3. As noted in the Analysis section, different low-level information provides key clues for detecting synthetic image methods, establishing distinct boundaries. For example, Image and NPR effectively separate BigGAN and StyleGAN, while DnCNN and NoisePrint delineate boundaries for ADM and Stable Diffusion. Our method adeptly selects the best features for classifying the current forgery type.

Refer to caption
Figure 4: Visualization of the Class Activation Map (CAM) corresponding to different forgery types and different low-level information [62]. Warmer colors indicate higher probabilities.

Core model components. Tab. 3 presents the ablation study of our proposed model components: Lora Expert (LE), Cross-Low-level Attention (CLA), Low-level Information Interaction Adapter (LIIA), and Dynamic Feature Selection (DFS). Utilizing individual components and various combinations enhances the model’s generalization performance on the test set. By employing all components, our method achieves improvements of 12.5% in Acc. and 10.8% in A.P. compared to using only low-level information and Image as input, followed by late fusion and fine-tuning the fully connected layer. To further illustrate the effectiveness of our fusion strategy, we visualize the Class Activation Map (CAM) for images with different forgery types and low-level information using the CAM method from [62], shown in Fig. 4. The results indicate that different low-level information highlights distinct regions for the same forgery type, and our fusion method effectively combines these focus regions to better identify hidden forgery clues in the images.

6 Conclusion

In this paper, we have discovered the advantage of various low-level features in enhancing the generalization capability of AI-generated image detection. We presents the Adaptive Low-level Experts Injection (ALEI) framework, which enhances AI-generated image detection through low-level features. By utilizing Lora Experts, our transformer-based approach learns from these features, merging them via a Cross-Low-level Attention layer. We introduce a Low-level Information Adapter to maintain the backbone’s modeling ability and employ Dynamic Feature Selection to optimize feature selection for current images. Our method achieved state-of-the-art results on multiple datasets, demonstrating improved generalization in detecting AI-generated images.

References

  • Bayar and Stamm [2016] Belhassen Bayar and Matthew C Stamm. A deep learning approach to universal image manipulation detection using a new convolutional layer. In Proceedings of the 4th ACM workshop on information hiding and multimedia security, pages 5–10, 2016.
  • Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
  • Chai et al. [2020] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. In Proceedings of the European Conference on Computer Vision, pages 103–120. Springer, 2020.
  • Chen et al. [2022] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  • Choi et al. [2018] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018.
  • Corvi et al. [2023] Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 973–982, 2023.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Durall et al. [2020] Ricard Durall et al. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In CVPR, pages 7890–7899, 2020.
  • Frank et al. [2020a] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning, pages 3247–3258. PMLR, 2020a.
  • Frank et al. [2020b] Joel Frank et al. Leveraging frequency analysis for deep fake image recognition. In ICML, pages 3247–3258. PMLR, 2020b.
  • Fridrich and Kodovsky [2012] Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital images. IEEE Transactions on information Forensics and Security, 7(3):868–882, 2012.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
  • Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  • Guillaro et al. [2023] Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20606–20615, 2023.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Jeong et al. [2022a] Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 48–57, 2022a.
  • Jeong et al. [2022b] Yonghyun Jeong, Doyeon Kim, Youngmin Ro, and Jongwon Choi. Frepgan: robust deepfake detection using frequency-level perturbations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1060–1068, 2022b.
  • Ju et al. [2022] Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. Fusing global and local features for generalized ai-synthesized image detection. In 2022 IEEE International Conference on Image Processing (ICIP), pages 3465–3469. IEEE, 2022.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Liu et al. [2022] Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. Detecting generated images by real images. In European Conference on Computer Vision, pages 95–110. Springer, 2022.
  • Liu et al. [2023a] Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Yao Zhao, and Jingdong Wang. Forgery-aware adaptive transformer for generalizable synthetic image detection. arXiv preprint arXiv:2312.16649, 2023a.
  • Liu et al. [2019] Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and Yinan Yu. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5187–5196, 2019.
  • Liu et al. [2023b] Weihuang Liu, Xi Shen, Chi-Man Pun, and Xiaodong Cun. Explicit visual prompting for low-level structure segmentations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19434–19445, 2023b.
  • Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
  • Liu et al. [2020] Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8060–8069, 2020.
  • Luo et al. [2021] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021.
  • Mandelli et al. [2022] Sara Mandelli, Nicolò Bonettini, Paolo Bestagini, and Stefano Tubaro. Detecting gan-generated images by orthogonal training of multiple cnns. In International Conference on Image Processing, pages 3091–3095. IEEE, 2022.
  • Masi et al. [2020] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and Wael AbdAlmageed. Two-branch recurrent network for isolating deepfakes in videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 667–684. Springer, 2020.
  • Nichol et al. [2022] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  • Ojha et al. [2023] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023.
  • Park et al. [2019] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
  • Peng et al. [2021] Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin Jiao, and Qixiang Ye. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 367–376, 2021.
  • Qian et al. [2020] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, pages 86–103. Springer, 2020.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Rossler et al. [2019] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
  • Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  • Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Shiohara and Yamasaki [2022] Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720–18729, 2022.
  • Shuai et al. [2023] Chao Shuai, Jieming Zhong, Shuang Wu, Feng Lin, Zhibo Wang, Zhongjie Ba, Zhenguang Liu, Lorenzo Cavallaro, and Kui Ren. Locate and verify: A two-stream network for improved deepfake detection. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7131–7142, 2023.
  • Tan et al. [2023a] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. arXiv preprint arXiv:2312.10461, 2023a.
  • Tan et al. [2023b] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized artifacts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12105–12114, 2023b.
  • Tan et al. [2024] Chuangchuang Tan, Renshuai Tao, Huan Liu, and Yao Zhao. Gangen-detection: A dataset generated by gans for generalizable deepfake detection. https://github.com/chuangchuangtan/GANGen-Detection, 2024.
  • Triaridis and Mezaris [2024] Konstantinos Triaridis and Vasileios Mezaris. Exploring multi-modal fusion for image manipulation detection and localization. In International Conference on Multimedia Modeling, pages 198–211. Springer, 2024.
  • Van Der Maaten [2014] Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. The journal of machine learning research, 15(1):3221–3245, 2014.
  • Wang et al. [2023a] Qingwei Wang, Jinyu Yang, Xiaosheng Yu, Fangyi Wang, Peng Chen, and Feng Zheng. Depth-aided camouflaged object detection. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3297–3306, 2023a.
  • Wang et al. [2020] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020.
  • Wang et al. [2023b] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295, 2023b.
  • Yu et al. [2015] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  • Yuan et al. [2021] Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 579–588, 2021.
  • Zamir et al. [2020] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Cycleisp: Real image restoration via improved data synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2696–2705, 2020.
  • Zhang et al. [2017] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017.
  • Zhao et al. [2023] Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5906–5916, 2023.
  • Zhong et al. [2023] Nan Zhong, Yiran Xu, Zhenxing Qian, and Xinpeng Zhang. Rich and poor texture contrast: A simple yet effective approach for ai-generated image detection. arXiv preprint arXiv:2311.12397, 2023.
  • Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • Zhu et al. [2023] Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. arXiv preprint arXiv:2306.08571, 2023.
\thetitle

Supplementary Material

Appendix A Appendix

A.1 More implementation details

Testing datasets. In the main text, we used three datasets, CNNDetectionBenchmark [54], GANGenDetectionBenchmark [50], UniversalFakeDetectBenchmark [36] and AIGCDetectBenchmark [61], to evaluate the generalization of our method across different types of forgeries. The following provides a more detailed description of these datasets:

  • CNNDetectionBenchmark [54]: This dataset includes fake images generated by various GAN methods such as ProGAN [21], StyleGAN [22], StyleGAN2 [23], BigGAN [2], CycleGAN [63], StarGAN [5], GauGAN [37], and DeepFake [43]. It also contains real images randomly selected from six datasets: LSUN [56], ImageNet [7], CelebA [30], CelebA-HQ [21], COCO [25], and FaceForensics++ [43]. This dataset is commonly used in early AIGC detection work.

  • GANGenDetectionBenchmark [50]: To better evaluate the generalization of our detection method on GAN-generated images, we follow [48] and extend our evaluation with images generated by 9 additional GAN models. Each GAN model includes 4K test images, with an equal number of real and fake images.

  • UniversalFakeDetectBenchmark [36]: This dataset includes test sets from diffusion methods such as ADM [8], DALL-E [41], LDM [42], and Glide [35]. Variants of these methods are also considered for LDM and Glide. Real image datasets are drawn from LAION [44] and ImageNet [7].

  • AIGCDetectBenchmark [61]: Similar to cnndetection, this dataset collects fake images generated by seven GAN-based models and real images from the same sources. Additionally, it incorporates whichfaceisreal (WFIR) and GenImage [64], collecting images from seven diffusion models.

Implementation details. For the LoRA expert module we use, we set α=8𝛼8\alpha=8italic_α = 8 and r=4𝑟4r=4italic_r = 4. As mentioned in the main text, these Lora experts are trained individually for each type of low-level information. The training steps are consistent with the implementation details in the main text. For the low-level encoder part, we also follow the same pre-training setup as in the main text, where the extracted features are trained using a classification head and cross-entropy loss to ensure that the features extracted from the low-level information are optimal for our classification task. We insert our Cross-Low-level attention layer and Low-level Information Adapter only at one-quarter, one-half, three-quarters, and the final layer of the pre-trained transformer backbone we use. We will provide the code for reproducing our experiments, and more implementation details can be found in the code.

A.2 More Experimental Results

Generator Image SRM LNP NPR Bayar DnCNN Noiseprint EarlyFusion LateFusion NPR(ResNet50) Ours
ProGAN 99.49 98.38 99.18 100.00 97.15 98.28 99.88 98.51 99.95 99.96 100.00
StyleGAN 89.45 79.00 69.22 96.59 77.85 83.68 82.69 83.99 99.12 97.28 98.35
BigGAN 96.95 82.23 88.33 86.13 70.28 81.40 72.53 75.88 87.78 85.88 94.51
CycleGAN 98.59 50.91 74.11 83.17 84.44 86.45 75.85 66.50 98.05 95.12 97.03
StarGAN 99.57 96.42 99.22 98.05 99.50 95.35 100.00 99.87 99.92 97.32 100.00
GauGAN 97.92 69.78 83.52 84.51 53.59 71.12 52.84 63.47 84.69 97.99 95.19
StyleGAN2 91.71 77.52 73.38 96.53 81.15 79.75 87.18 78.79 96.61 99.56 98.88
whichfaceisreal 83.25 51.95 50.00 70.30 50.00 45.85 90.45 51.95 71.30 50.35 75.71
ADM 77.78 89.61 82.54 68.88 89.47 92.26 79.72 57.19 87.05 71.30 88.43
Glide 84.99 93.58 75.21 86.25 90.14 93.97 74.70 39.67 88.41 94.11 91.53
Midjourney 58.14 51.14 50.59 86.39 50.00 87.23 93.58 55.34 91.33 74.30 91.56
SDv1.4 74.29 50.02 50.20 86.12 50.00 81.24 91.18 56.70 89.83 69.43 93.28
SDv1.5 74.40 49.96 49.96 85.88 50.00 81.29 91.14 56.52 89.96 69.51 93.38
VQDM 85.43 77.27 68.51 69.94 87.79 93.07 87.43 57.25 91.23 80.80 90.94
wukong 77.29 50.04 50.14 78.88 50.00 76.67 89.52 56.87 83.45 61.97 89.46
DALLE2 75.90 92.30 83.05 76.00 94.55 95.00 93.40 33.45 92.40 93.25 93.32
Average 85.32 72.51 71.70 84.60 73.49 83.91 85.13 64.50 90.69 83.63 93.29
Table 6: The detection accuracy comparison between different low-level information and fusion method. Among all detectors, the best result and the second-best result are denoted in boldface and underlined, respectively.

Comparison on testing datasets. The raw experimental data used to plot Fig. 1 and for the analysis in the methods section is presented in Tab. 6. Tab. 7 evaluate the Acc. and A.P. metrics on CNNDetection. Our method achieves excellent results compared to multiple baseline methods and yields comparable results with the current state-of-the-art methods NPR [27] and FAFormer [48]. Specifically, our method improves Acc. by 3.4%percent3.43.4\%3.4 % and 0.1%percent0.10.1\%0.1 % compared to [27] and [48], respectively. For the StyleGAN, where [27] performs poorly, and the BigGAN, where [48] underperforms, our method improves the average accuracy by 10.7%percent10.710.7\%10.7 % and 7.0%percent7.07.0\%7.0 %, respectively. This demonstrates that our method, by incorporating multiple low-level information, uniformly enhances the detection performance across different GAN generation methods. Tab. 8, Tab. 9 and Tab. 10 are the complete versions of Tab. 1, Tab. 3 and Tab. 5 presented in the main text, respectively. They include more baseline method comparisons and additional test results on more datasets. Tab. 11 presents the results of some combinations of low-level information not utilized in the main text, demonstrating that our framework can effectively integrate other low-level information that may possess generalization capabilities.

Method ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN Deepfake Mean
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection 91.4 99.4 63.8 91.4 76.4 97.5 52.9 73.3 72.7 88.6 63.8 90.8 63.9 92.2 51.7 62.3 67.1 86.9
Frank 90.3 85.2 74.5 72.0 73.1 71.4 88.7 86.0 75.5 71.2 99.5 99.5 69.2 77.4 60.7 49.1 78.9 76.5
Durall 81.1 74.4 54.4 52.6 66.8 62.0 60.1 56.3 69.0 64.0 98.1 98.1 61.9 57.4 50.2 50.0 67.7 64.4
Patchfor 97.8 100.0 82.6 93.1 83.6 98.5 64.7 69.5 74.5 87.2 100.0 100.0 57.2 55.4 85.0 93.2 80.7 87.1
F3Net 99.4 100.0 92.6 99.7 88.0 99.8 65.3 69.9 76.4 84.3 100.0 100.0 58.1 56.7 63.5 78.8 80.4 86.2
SelfBlend 58.8 65.2 50.1 47.7 48.6 47.4 51.1 51.9 59.2 65.3 74.5 89.2 59.2 65.5 93.8 99.3 61.9 66.4
GANDetection 82.7 95.1 74.4 92.9 69.9 87.9 76.3 89.9 85.2 95.5 68.8 99.7 61.4 75.8 60.0 83.9 72.3 90.1
BiHPF 90.7 86.2 76.9 75.1 76.2 74.7 84.9 81.7 81.9 78.9 94.4 94.4 69.5 78.1 54.4 54.6 78.6 77.9
FrePGAN 99.0 99.9 80.7 89.6 84.1 98.6 69.2 71.1 71.1 74.4 99.9 100.0 60.3 71.7 70.9 91.9 79.4 87.2
LGrad 99.9 100.0 94.8 99.9 96.0 99.9 82.9 90.7 85.3 94.0 99.6 100.0 72.4 79.3 58.0 67.9 86.1 91.5
UnivFD 99.7 100.0 89.0 98.7 83.9 98.4 90.5 99.1 87.9 99.8 91.4 100.0 89.9 100.0 80.2 90.2 89.1 98.3
NPR 99.8 100.0 96.3 99.8 97.3 100.0 87.5 94.5 95.0 99.5 99.7 100.0 86.6 88.8 77.4 86.2 92.5 96.1
FAFormer 99.8 100.0 87.7 97.4 91.1 99.3 98.9 99.9 99.9 100.0 100.0 100.0 99.9 100.0 89.4 97.3 95.8 99.2
Ours 100.0 100.0 98.4 100.0 98.9 100.0 94.5 98.8 97.0 99.9 100.0 100.0 95.2 98.9 83.4 88.2 95.9 98.2
Table 7: Cross-GAN-Sources Evaluation on the test set of CNNDetection [54]. Partial results from [27, 48].
Generator CNNDet FreDect Fusing GramNet LNP LGrad DIRE-G DIRE-D UnivFD PatchCraft Ours
ProGAN 100.00 99.36 100.00 99.99 99.95 99.83 95.19 52.75 99.81 100.00 100.00
StyleGAN 90.17 78.02 85.20 87.05 92.64 91.08 83.03 51.31 84.93 92.77 98.35
BigGAN 71.17 81.97 77.40 67.33 88.43 85.62 70.12 49.70 95.08 95.80 94.51
CycleGAN 87.62 78.77 87.00 86.07 79.07 86.94 74.19 49.58 98.33 70.17 97.03
StarGAN 94.60 94.62 97.00 95.05 100.00 99.27 95.47 46.72 95.75 99.97 100.00
GauGAN 81.42 80.57 77.00 69.35 79.17 78.46 67.79 51.23 99.47 71.58 95.19
StyleGAN2 86.91 66.19 83.30 87.28 93.82 85.32 75.31 51.72 74.96 89.55 98.88
whichfaceisreal 91.65 50.75 66.80 86.80 50.00 55.70 58.05 53.30 86.90 85.80 75.71
ADM 60.39 63.42 49.00 58.61 83.91 67.15 75.78 98.25 66.87 82.17 88.43
Glide 58.07 54.13 57.20 54.50 83.50 66.11 71.75 92.42 62.46 83.79 91.53
Midjourney 51.39 45.87 52.20 50.02 69.55 65.35 58.01 89.45 56.13 90.12 91.56
SDv1.4 50.57 38.79 51.00 51.70 89.33 63.02 49.74 91.24 63.66 95.38 93.28
SDv1.5 50.53 39.21 51.40 52.16 88.81 63.67 49.83 91.63 63.49 95.30 93.38
VQDM 56.46 77.80 55.10 52.86 85.03 72.99 53.68 91.90 85.31 88.91 90.94
wukong 51.03 40.30 51.70 50.76 86.39 59.55 54.46 90.90 70.93 91.07 89.46
DALLE2 50.45 34.70 52.80 49.25 92.45 65.45 66.48 92.45 50.75 96.60 93.32
Average 69.73 63.28 67.63 68.43 85.28 75.11 67.90 72.70 76.80 89.85 93.29
Table 8: The detection accuracy comparison between our approach and baselines. Among all detectors, the best result and the second-best result are denoted in boldface and underlined, respectively.
Method AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN S3GAN SNGAN STGAN Mean
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDet 51.1 83.7 50.2 44.9 81.5 97.5 71.1 94.7 72.9 94.4 53.3 82.1 55.2 66.1 62.7 90.4 63.0 92.7 62.3 82.9
Frank 65.0 74.4 39.4 39.9 31.0 36.0 41.1 41.0 38.4 40.5 69.2 96.2 69.7 81.9 48.4 47.9 25.4 34.0 47.5 54.7
Durall 39.9 38.2 48.2 30.9 60.9 67.2 50.1 51.7 59.5 65.5 80.0 88.2 87.3 97.0 54.8 58.9 62.1 72.5 60.3 63.3
Patchfor 68.0 92.9 97.1 100.0 97.8 99.9 93.6 98.2 97.9 100.0 99.6 100.0 66.8 68.1 97.6 99.8 92.7 99.8 90.1 95.4
F3Net 85.2 94.8 87.1 97.5 89.5 99.8 67.1 83.1 73.7 99.6 98.8 100.0 65.4 70.0 51.6 93.6 60.3 99.9 75.4 93.1
SelfBlend 63.1 66.1 56.4 59.0 75.1 82.4 79.0 82.5 68.6 74.0 73.6 77.8 53.2 53.9 61.6 65.0 61.2 66.7 65.8 69.7
GANDet 57.4 75.1 67.9 100.0 67.8 99.7 67.6 92.4 67.7 99.3 60.9 86.2 69.6 83.5 66.7 90.6 69.6 97.2 66.1 91.6
LGrad 68.6 93.8 69.9 89.2 50.3 54.0 71.1 82.0 57.5 67.3 89.1 99.1 78.5 86.0 78.0 87.4 54.8 68.0 68.6 80.8
UnivFD 78.5 98.3 72.0 98.9 77.6 99.8 77.6 98.9 77.6 99.7 78.2 98.7 85.2 98.1 77.6 98.7 74.2 97.8 77.6 98.8
NPR 83.0 96.2 99.0 99.8 98.7 99.0 94.5 98.3 98.6 99.0 99.6 100.0 79.0 80.0 88.8 97.4 98.0 100.0 93.2 96.6
Ours 86.2 97.8 100.0 100.0 100.0 100.0 98.6 99.9 99.3 99.8 100.0 100.0 83.0 87.0 90.4 98.7 100.0 100.0 95.3 98.1
Table 9: Cross-GAN-Sources Evaluation on the GANGenDetection [50]. Partial results from [48]
Method DALLE Glide_100_10 Glide_100_27 Glide_50_27 ADM LDM_100 LDM_200 LDM_200_cfg Mean
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDet 51.8 61.3 53.3 72.9 53.0 71.3 54.2 76.0 54.9 66.6 51.9 63.7 52.0 64.5 51.6 63.1 52.8 67.4
Frank 57.0 62.5 53.6 44.3 50.4 40.8 52.0 42.3 53.4 52.5 56.6 51.3 56.4 50.9 56.5 52.1 54.5 49.6
Durall 55.9 58.0 54.9 52.3 48.9 46.9 51.7 49.9 40.6 42.3 62.0 62.6 61.7 61.7 58.4 58.5 54.3 54.0
Patchfor 79.8 99.1 87.3 99.7 82.8 99.1 84.9 98.8 74.2 81.4 95.8 99.8 95.6 99.9 94.0 99.8 86.8 97.2
F3Net 71.6 79.9 88.3 95.4 87.0 94.5 88.5 95.4 69.2 70.8 74.1 84.0 73.4 83.3 80.7 89.1 79.1 86.5
SelfBlend 52.4 51.6 58.8 63.2 59.4 64.1 64.2 68.3 58.3 63.4 53.0 54.0 52.6 51.9 51.9 52.6 56.3 58.7
GANDet 67.2 83.0 51.2 52.6 51.1 51.9 51.7 53.5 49.6 49.0 54.7 65.8 54.9 65.9 53.8 58.9 54.3 60.1
LGrad 88.5 97.3 89.4 94.9 87.4 93.2 90.7 95.1 86.6 100.0 94.8 99.2 94.2 99.1 95.9 99.2 90.9 97.2
UnivFD 89.5 96.8 90.1 97.0 90.7 97.2 91.1 97.4 75.7 85.1 90.5 97.0 90.2 97.1 77.3 88.6 86.9 94.5
NPR 94.5 99.5 98.2 99.8 97.8 99.7 98.2 99.8 75.8 81.0 99.3 99.9 99.1 99.9 99.0 99.8 95.2 97.4
FAFormer 98.8 99.8 94.2 99.2 94.4 99.1 94.7 99.4 76.1 92.0 98.7 99.9 98.6 99.8 94.9 99.1 93.8 95.5
Ours 97.7 99.7 97.9 99.2 97.3 99.1 98.6 99.9 90.1 96.4 99.5 99.9 98.9 99.3 98.5 99.5 97.3 99.1
Table 10: Cross-Diffusion-Sources Evaluation on the diffusion test set of UniversalFakeDetect [36]. Partial results from [27, 48].
Image SRM LNP Bayar Acc. A.P.
\checkmark 85.3 91.8
\checkmark 72.5 84.4
\checkmark 71.7 83.2
\checkmark 73.5 87.9
\checkmark \checkmark 87.8 92.4
\checkmark \checkmark \checkmark 89.3 93.1
\checkmark \checkmark \checkmark 88.1 92.7
\checkmark \checkmark \checkmark 90.4 93.0
\checkmark \checkmark \checkmark \checkmark 90.7 95.6
Table 11: Robustness performance(Acc.) on different baselines and our method. the best result and the second-best result are denoted in boldface and underlined, respectively
Detector JPEG Downsampling Blur
CNNDetction 64.03 58.85 68.39
FreDect 66.95 35.84 65.75
Fusing 62.43 50.00 68.09
GramNet 65.47 60.30 68.63
LNP 53.56 63.28 65.88
LGrad 51.55 60.86 71.73
DIRE-G 66.49 56.09 64.00
DIRE-D 70.27 62.26 70.46
UnivFD 74.10 70.87 70.31
Patchcraft 72.48 78.36 75.99
Ours 80.52 84.49 83.26
Table 12: Robustness performance(Acc.) on different baselines and our method. the best result and the second-best result are denoted in boldface and underlined, respectively
Arch Pretrain w/Ours Acc. A.P.
ViT-B ImageNet [7] ×\times× 71.7 88.5
\checkmark 85.4 93.6
ViT-L ImageNet [7] ×\times× 76.2 89.0
\checkmark 89.7 94.2
ViT-B SAM [24] ×\times× 63.3 81.2
\checkmark 80.1 89.9
ViT-L SAM [24] ×\times× 66.6 82.4
\checkmark 81.1 86.8
ViT-B CLIP [40] ×\times× 72.5 85.1
\checkmark 86.8 93.6
ViT-L CLIP [40] ×\times× 76.8 90.2
\checkmark 93.3 98.4
Table 13: Analysis of different architectures and pretraining strategies.

Robustness Tests. In real-world applications, images spread on public platforms may undergo various common image processing techniques like JPEG compression. Therefore, it is important to evaluate the performance of the detector when handling distorted images. We adopt three common image distortions, including JPEG compression (quality factor QF=95), Gaussian blur (σ=1𝜎1\sigma=1italic_σ = 1), and image downsampling, where the image size is reduced to a quarter of its original size (r=0.5𝑟0.5r=0.5italic_r = 0.5). Consistent with previous methods [54] and [61], we augment the training set using the aforementioned image distortion methods and test on the AIGCDetectBenchmark test set processed with these distortion methods. The results are presented in Tab. 13. The results show that compared to previous methods, our method achieves better robustness, outperforming the current best methods by 8.04%percent8.048.04\%8.04 %, 6.13%percent6.136.13\%6.13 %, and 7.27%percent7.277.27\%7.27 % in robustness tests for JPEG compression, Gaussian blur, and image downsampling, respectively. Fig. 5 visualizes the low-level information we use, including high-level images, before and after these operations. For low-level information, these operations partially affect it. However, due to our robust training and the introduction of high-level images along with multiple low-level features, our method’s robustness is effectively enhanced.

Transfer to other pretraining methods. To further demonstrate the generality of our proposed method, we analyze its performance when combined with different architectures and pretraining strategies. Tab. 13 shows the Acc. and A.P. metrics for different pretrained models and various backbones. By comparing the performance with and without our method, we verify the effectiveness of incorporating low-level information and using our fusion architecture under different pretraining frameworks. This significantly improves the generalization of these methods for detecting synthetic images.

A.3 Broader impacts and Limitation

As AI-generated image detection methods continue to evolve, they aim to combat the growing influx of fake information and the constantly updating AIGC technologies. However, these methods may have unintended consequences in the realm of content moderation. Legitimate human-created content that resembles forgeries may be incorrectly identified as AI-generated images, while some highly realistic AI-generated images might be recognized by algorithms as genuine. This could impact the sharing of normal information based on image morphology. Further research and consideration are needed when applying this work to practical applications in content moderation.

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Refer to caption
((d))
Figure 5: The visualization results of the image and low-level information.