0% found this document useful (0 votes)
27 views9 pages

CNC: Cross-Modal Normality Constraint For Unsupervised Multi-Class Anomaly Detection

The document presents a novel approach called Cross-modal Normality Constraint (CNC) for unsupervised multi-class anomaly detection, addressing the issue of 'over-generalization' in existing methods. CNC utilizes class-agnostic learnable prompts to capture common normality across visual patterns and incorporates a gated mixture-of-experts module to enhance detection performance by reducing interference among diverse patch patterns. The proposed method demonstrates competitive results on the MVTec AD and VisA datasets, showcasing its effectiveness in improving anomaly localization and detection accuracy.

Uploaded by

오예림
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views9 pages

CNC: Cross-Modal Normality Constraint For Unsupervised Multi-Class Anomaly Detection

The document presents a novel approach called Cross-modal Normality Constraint (CNC) for unsupervised multi-class anomaly detection, addressing the issue of 'over-generalization' in existing methods. CNC utilizes class-agnostic learnable prompts to capture common normality across visual patterns and incorporates a gated mixture-of-experts module to enhance detection performance by reducing interference among diverse patch patterns. The proposed method demonstrates competitive results on the MVTec AD and VisA datasets, showcasing its effectiveness in improving anomaly localization and detection accuracy.

Uploaded by

오예림
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CNC: Cross-modal Normality Constraint for Unsupervised

Multi-class Anomaly Detection


Xiaolei Wang1,2,3 * , Xiaoyang Wang1,2,3 * , Huihui Bai4 , Eng Gee Lim1 , Jimin Xiao1†
1
Xi’an Jiaotong-Liverpool University
2
University of Liverpool
3
Dinnar Automation Technology
4
Beijing Jiaotong University
{Xiaolei.Wang, wangxy}@liverpool.ac.uk, [email protected], {enggee.lim, jimin.xiao}@xjtlu.edu.cn
arXiv:2501.00346v1 [cs.CV] 31 Dec 2024

Abstract Therefore, UniAD (You et al. 2022) proposed a challeng-


ing multi-class AD setting, i.e., training one model to detect
Existing unsupervised distillation-based methods rely on the
differences between encoded and decoded features to lo-
anomalies from multiple categories.
cate abnormal regions in test images. However, the de- Reverse distillation (RD) (Deng and Li 2022) is a highly
coder trained only on normal samples still reconstructs ab- effective unsupervised AD (UAD) method. It employs a
normal patch features well, degrading performance. This is- learnable decoder (student network) to reconstruct features
sue is particularly pronounced in unsupervised multi-class from a pre-trained encoder (teacher network) on normal
anomaly detection tasks. We attribute this behavior to ‘over- samples via patch-level cosine distance minimization. Ide-
generalization’ (OG) of decoder: the significantly increasing
ally, the learned decoder should only recover the encoded
diversity of patch patterns in multi-class training enhances
the model generalization on normal patches, but also inad- normal patches, while failing to reconstruct unseen abnor-
vertently broadens its generalization to abnormal patches. To mal patterns. Anomaly regions are then detected by compar-
mitigate ‘OG’, we propose a novel approach that leverages ing features before and after decoding. In multi-class train-
class-agnostic learnable prompts to capture common textual ing, a single model is optimized on the normal samples
normality across various visual patterns, and then apply them from multiple classes to achieve unified detection. While
to guide the decoded features towards a ‘normal’ textual rep- the increasing training diversity can improve model gen-
resentation, suppressing ‘over-generalization’ of the decoder eralization on reconstructing normal patches, it also leads
on abnormal patterns. To further improve performance, we to undesired generalization to unseen abnormal patch pat-
also introduce a gated mixture-of-experts module to special- terns. Consequently, abnormal regions are recovered well
ize in handling diverse patch patterns and reduce mutual in-
during inference, narrowing the difference between encoded
terference between them in multi-class training. Our method
achieves competitive performance on the MVTec AD and and decoded features and degrading detection performance
VisA datasets, demonstrating its effectiveness. (Fig. 1(B) I.). We term this issue ‘over-generalization’ (OG).
The key question remains: How can we effectively mitigate
Code — https://github.com/cvddl/CNC ‘OG’ while preserving the generalization on normal samples
in the multi-class distillation framework?
Introduction To address this challenge, we seek to incorporate an ad-
ditional constraint in the decoding process. Leveraging in-
Visual anomaly detection (AD) mainly focuses on identi- sights from vision-language models (VLMs) (Radford et al.
fying unexpected patterns (deviating from our familiar nor- 2021), we observe that normal and abnormal regions within
mal ones) within samples. Industrial defect detection is one a sample exhibit distinct responses to the same text descrip-
of the most widely used branches of AD (Bergmann et al. tion, as illustrated in Fig. 1(A). We propose to exploit this
2019), which requires models to automatically recognize distinction in cross-modal response to differentiate the de-
various defects on the surface of industrial products, such coding of normal and abnormal patch features, thereby hin-
as scratches, damages, and misplacement. Due to the in- dering the recovery of abnormal patterns. Specifically, we
ability to fully collect and annotate anomalies, unsupervised employ class-agnostic learnable prompts to extract the com-
methods (Yu et al. 2021; Gudovskiy, Ishizaka, and Kozuka mon normality from encoded visual features across differ-
2022; Liu et al. 2023b) become mainstream solutions for ent classes. These prompts serve as anchors in the textual
AD. Previous unsupervised methods mostly train one model space, aligning the decoded normal features with a univer-
for one class of data, which requires large parameter storage sal representation of normality and suppressing the ‘over-
and long training time as the number of classes increases. generalization’ of the decoder towards abnormal patterns
* These authors contributed equally. (Fig. 1(B) II.). We also design a normality promotion mech-

Corresponding author. anism for feature distillation, introducing cross-modal acti-
Copyright © 2025, Association for the Advancement of Artificial vation as a control coefficient on visual features to increase
Intelligence (www.aaai.org). All rights reserved. sensitivity to unexpected abnormal patterns. We term the
Test Sample GT Visual-only CNC
A photo of 98.6
Activate 97.2
a normal

Tile
GT object
96.5
‘Normal’ Prompt Result
96.8

Hazelnut
89.0 87.3
A photo of Activate 94.6 98.0
85.4 96.1 96.8
Abnormal a damaged
85.2
Sample object
‘Abnormal’ Prompt Result

Cable
90.7
90.7
(A) Correspondence between Two Modalities 91.2
96.5 43.4
98.8
Low

Zipper
99.0 93.0
Difference 48.6
99.3
GT 52.6
Constraint

Grid
56.4
Visual-only After Constraint
Encoded Feature Decoded Feature
(B) The Motivation of Cross-modal Constraint (C) Comprehensive Comparison on MVTech AD (D) Visualization of Cross-modal Constraint

Figure 1: (A) and (B) show the correspondence between visual and text modality and the motivation of CNC, respectively. (C)
shows a comprehensive performance comparison with previous SOTA methods that only learn sample normality on the visual
modality on MVTec AD dataset. (D) gives some visualization results of cross-modal constraint on MVTec AD dataset.

combination of these two strategies Cross-modal Normal- visual decoding, effectively reducing the effect of ‘OG’.
ity Constraint (CNC), which aims to mitigate the ‘OG’ issue • We design a gated MoE module to selectively handle var-
and enhance anomaly localization in multi-class distillation ious patch patterns, mitigating inter-pattern interference
frameworks (see Fig. 1(D) for visualization results). and enhancing detection performance.
Another angle to tackle the ‘OG’ issue is to mitigate the
• Our novel cross-modal distillation framework, built from
mutual interference among different patch patterns produced
scratch, achieves competitive performance on the MVTec
by increasing categories during feature distillation learning.
AD and VisA datasets.
The success of previous one-model-one-class settings can be
attributed to the separate learning of patterns for each class,
without interference from patch patterns from other classes. Related Work
However, in multi-class training, inter-class interference is Anomaly Detection
unavoidable. To address this issue, we propose constructing Visual anomaly detection contains various settings accord-
multiple expert networks to specialize in handling different ing to specific engineering requirements, e.g., unsupervised
patch patterns. We find that the mixture-of-experts (MoE) AD (Yi and Yoon 2020; Zou et al. 2022; Gu et al. 2023;
framework (Shazeer et al. 2017; Ma et al. 2018) can selec- Cao, Zhu, and Pang 2023; Liu et al. 2023a; Zhang, Xu,
tively process distinct patch patterns, assigning each patch and Zhou 2024), zero and few-shot AD (Huang et al. 2022;
a distinct weighted combination of experts to alleviate the Fang et al. 2023; Lee and Choi 2024), noisy AD (Chen
mutual interference. By combining a vanilla RD framework et al. 2022; Jiang et al. 2022), and 3D AD (Gu et al. 2024;
with our CNC and MoE, we achieve performances that sur- Costanzino et al. 2024; Liu et al. 2024; Li et al. 2024a).
pass previous methods (Deng and Li 2022; You et al. 2022; Existing unsupervised AD methods can be roughly divided
He et al. 2024b) across multiple metrics (see Fig. 1(C)). into reconstruction-based (Tien et al. 2023; Lu et al. 2023b;
Our work primarily addresses the inherent ‘over- He et al. 2024b), feature-embedding-based (McIntosh and
generalization’ issue for distillation frameworks in multi- Albu 2023; Roth et al. 2022; Lei et al. 2023), augmentation-
class training. To this end, we propose two key strategies: based (Zavrtanik, Kristan, and Skočaj 2021; Zhang, Xu, and
cross-modal normality constraint to facilitate visual decod- Zhou 2024; Lin and Yan 2024) methods.
ing, and a mixture-of-experts (MoE) module to process di-
verse patch patterns selectively. We conduct comprehen- Reconstruction-based Method The reconstruction-based
sive experiments to demonstrate the efficacy of our ap- method employs the autoencoder framework to learn the
proach, yielding a notable performance gain over single- normality of training samples by reconstructing data or its
modal methods. The main contributions of this work are features. Therefore, some works (Zhang et al. 2023; Guo
summarized as follows: et al. 2024) rethink RD as reconstruction-based method. Al-
though this type of approach offers fast inference speed, the
• We identify the ‘OG’ issue in multi-class distillation anomaly localization is inevitably degraded by ‘OG’, which
frameworks and propose a two-pronged solution to ad- is attributed to the increasing diversity of patch patterns in
dress this challenge. multi-class training. To address the issue, we propose CNC
• We introduce a cross-modal normality constraint to guide and MoE to alleviate undesired generalization.
Learnable Gated Mixture-of-Experts (MoE) Multi-Layer Fusion (MLF)
Frozen Output
Optimization

Concatenation
Skip Connection

Dropout
Top-K Block1 Block2 Block3
Legend
FNP FNP FNP MoE
Router Network

Patch Embeddings

Learnable Prompts FNP FNP FNP MLF Feature-level Normality Promotion (FNP)

Subtraction
Extended
Block1 Block2 Block3
Training Samples Cross-modal Normality Distillation Framework

Figure 2: Overview of the proposed cross-modal normality distillation framework. Additionally, details of feature-level nor-
mality promotion, multi-layer fusion, and gated mixture-of-experts are illustrated in this graph.

Feature-embedding-based Method These methods typi- such as a photo of normal/damaged [class],


cally rely on pre-trained networks to extract feature embed- to achieve zero-shot anomaly detection.
ding vectors from a feature extractor trained on natural im-
ages, then apply density estimation (Defard et al. 2021; Yao Prompt Learning Prompt learning (Jia et al. 2022; Zhou
et al. 2023), memory bank (Bae, Lee, and Kim 2023), etc., et al. 2022) focuses on optimizing the input prompts to
to detect anomalies. However, there is a significant gap be- enhance the language or multi-modal model performance
tween industrial and natural data, and the extracted embed- on specific tasks. CoOp (Zhou et al. 2022) introduces a
ding may not be suitable for anomaly detection tasks. learnable prompt, p = [V1 ], [V2 ], · · · , [VM ], [class], to
achieve few-shot classification, where each [Vm ] is a learn-
Augmentation-based Method Early augmentation meth- able token, and M is the number of tokens. However, in
ods (Li et al. 2021; Lu et al. 2023a) typically rely on simple multi-class UAD task, we do not expect to utilize any class
handcraft augmentation techniques such as rotation, trans- information. Following works (Zhou et al. 2023; Li et al.
lation, and texture pasting to generate pseudo samples for 2024b), our applied learnable prompts are defined as:
discrimination training. However, a huge gap exists between
the obtained synthesized samples and the real defect sam- pn = [V1n ], [V2n ], · · · , [VM
n
], [object], (2)
ples. Therefore, some works based on the generative model pa = [V1a ], [V2a ], · · · , [VM
a
], [damaged], [object], (3)
are proposed recently, such as VAE-based (Lee and Choi
2024), GAN-based (Wang et al. 2023), diffusion-based (Hu where pn and pa are normal and abnormal prompts respec-
et al. 2024; Zhang, Xu, and Zhou 2024) methods. Due to the tively. We argue that utilizing category-agnostic prompts en-
incomplete collection of defect shapes and categories, it is ables the learning of common normality patterns across sam-
impossible to synthesize all types of anomalies. ples from different classes. Different from previous meth-
ods, we apply these prompts to learn the normality of train-
ing samples and leverage them to guide visual decoding.
Preliminaries
CLIP Contrastive Language Image Pre-training (Radford Methodology
et al. 2021) (CLIP) is a large-scale vision-language model
famous for its multi-modal alignment ability via training Overview
with a lot of image-prompt pair data. Specifically, given The framework of the proposed method is shown in Fig. 2.
an unknown image x and text-prompts {p1 , p2 , · · · , pJ }, Our method consists of three main sections: (1) Visual Dis-
CLIP can predict the probability of alignment between x and tillation Framework; (2) Cross-modal Normality Constraint;
every prompt pj as follows: (3) Gated Mixture-of-Experts. In (1), we introduce our basic
reverse distillation network. In (2), we propose cross-modal
exp(F(x) · G(py )/τ ) normality constraint to ensure decoded features to align with
p(y|x) = PJ , (1) a textual ‘normal’ representation. Additionally, we propose
j=1 exp(F(x) · G(pj )/τ )
a cross-modal control coefficient on the visual feature to im-
where F(·) and G(·) are CLIP visual and text encoder, re- prove sensitivity to abnormal patch patterns, which is called
spectively, and τ is a temperature hyperparameter. Previ- feature-level normality promotion. In (3), a gated mixture-
ous work (Jeong et al. 2023) adopts handcraft prompts, of-experts (MoE) module is shown in detail to specifically
handle various patch patterns. Finally, the inference phase Feature Distillation with Normality Promotion In this
of our method is given for convenience. section, textual features {[gni , gai ]}3i=1 are applied on distilla-
tion to improve sensitivity to anomaly patch patterns. Specif-
Visual Distillation Framework ically, we first define a cross-modal control coefficient on vi-
Compared with previous single-modal distillation-based sual encoded and decoded features fi and fbi . It is designed to
methods (Deng and Li 2022; Tien et al. 2023), we se- compute a cross-modal activation map between visual patch
lect multi-modal backbone, CLIP-ViT, as encoder, which features and text features gni , improving sensitivity to unex-
means that text-modal information can be adopted to im- pected abnormal patch patterns. Specifically, we design the
prove detection performance. Specifically, for a given im- new encoded feature fi∗ with a cross-modal normality con-
age x ∈ RH0 ×W0 ×3 , the CLIP visual encoder F encodes trol coefficient as follows:
x into multi-layer latent space features as {fi }N i=1 , where fi∗ = fi ⊕ λΨ(αi , βi ), (5)
fi ∈ RH×W ×C represents i-th layer feature (feature size
in each layer of ViT is consistent), and N is the number of where fi , fi∗ ∈ RH×W ×C , λ = 1/∥fi ∥ is a scaled coeffi-
residual attention layers in ViT (Dosovitskiy et al. 2020). We cient, ∥ · ∥ is the L2 norm, and control coefficient Ψ(α, β) is
select i1 -th, i2 -th, i3 -th layer features as visual encoded fea- written as
tures. For convenience, let FE1 , FE2 , FE3 denote i1 -th, i2 -th, 1
Ψ(αi , βi ) = (1 + tanh(αi − βi )), (6)
i3 -th blocks respectively, and f1 , f2 , f3 is the corresponding 2
encoded features. For visual decoder, we adopt three gen- where weight maps α and β are defined as:
eral residual attention blocks (the basic module of ViT) to-
gether as decoder and extract corresponding features to re- αi = fi ⊗ gni , βi = fi ⊗ gai , (7)
construct. Therefore, we denote decoder as FD with three where gni ∈ R 1×1×C
, αi , βi ∈ R H×W
, ⊗ denotes the
i 3
blocks {FD }i=1 , with corresponding decoded features as vector-wise product between gni and each patch embedding
3
{fi }i=1 (see Fig. 2). In addition, to further alleviate the
b zli of fi , l is the index of patch embedding. Therefore,
‘over-generalization’, Gaussian noise is applied on the en- Ψ(αi , βi ) ∈ RH×W , and ⊕ is element-wise addition opera-
coded feature fi to obtain its perturbed version finoise . tion. Similarly, following Eq. (5) and Eq. (6), we also obtain
the decoded feature fbi∗ with cross-modal control coefficient:
Cross-modal Normality Constraint
To alleviate the unexpected ‘over-generalization’ in multi- fbi∗ = fbi ⊕ λΨ(b
αi , βbi ), (8)
class training, we propose a text-modal normality constraint where αbi = fbi ⊗ gni , βbi = fbi ⊗ gai .
to guide the decoded features towards a ‘normal’ textual We call the above step ‘feature-level normality promo-
representation, suppressing the ‘over-generalization’ of the tion’ (FNP), as shown in Fig. 2. Next, we give a new cross-
decoder towards the abnormal direction. The key to our modal distillation loss to ensure the consistency of the en-
proposed cross-modal normality constraint lies in applying coded and decoded features with the corresponding control
learnable category-agnostic prompts to learn common nor- coefficient as follows:
mality from various normal samples and maintain the se- 3 
mantic consistency of visual encoded and decoded features
X Flat(fi∗ ) · Flat(fbi∗ ) 
Ldistill = 1− , (9)
in textual space during the training phase. ∥Flat(fi∗ )∥∥Flat(fbi∗ )∥
i=1
Learning Cross-modal Normality In this section, we aim where Flat(·) is the fatten function.
to apply class-agnostic learnable prompts to learn textu-
ral normality from various encoded features. Specifically, Feature Decoding with Normality Constraint In this
for a given image x, we apply {FEi }3i=1 blocks in en- section, we apply textual features {[gni , gai ]}3i=1 trained
coder to obtain multiple layer features {fi }3i=1 . According by (4) as anchors to guide feature decoding, alleviating un-
to Eq. (2) and Eq. (3), we initialize three sets of learn- expected ‘OG’. Our solution is to constrain the textual rep-
able prompts {[pin , pia ]}3i=1 , where {pin }3i=1 are applied to resentation of the decoded features not to deviate from ‘nor-
learn textual normality from different layer visual encoded mal’ during training, i.e., we also keep class tokens of de-
features. Then, each prompt pair [pin , pia ] is input to CLIP coded features aligning with normal text features {gni }:
text encoder G(·), producing the corresponding text feature 3
X exp(b ei · gni /τ )
[gni , gai ], where gni = G(pin ), gai = G(pia ). Next, we employ L2c = −log , (10)
a modal-alignment optimization object to learn the textual i=1
ei · gni /τ ) + exp(b
exp(b ei · gai /τ )
normality from {fi }3i=1 :
ei represents the global feature of fbi . We combine
where b
3 formulas (4) and (10) to give the cross-modal constraint loss:
X exp(ei · gni /τ )
L1c = −log , (4)  1
exp(ei · gni /τ ) + exp(ei · gai /τ ) Lc if epoch < ϑ
i=1 Lconstraint = , (11)
L1c + γL2c if epoch ≥ ϑ
where ei represents the global feature of fi , τ is a tem-
perature coefficient. Next, textual features {[gni , gai ]}3i=1 are where γ = 0.1 is a hyperparameter. Each text feature gni
adopted in feature distillation and decoding to alleviate un- is a dynamic anchor that is used as a medium to keep the
expected ‘over-generalization’. decoded feature toward the ‘normal’ direction.
Gated Mixture-of-Experts Module Experiments
Multi-Layer Fusion Following works (Deng and Li 2022;
Tien et al. 2023), different layer features of pre-trained en-
Experimental Setup
coder are aggregated, improving detection performance. For Datasets MVTec AD (Bergmann et al. 2019) is the most
an input x, we first apply encoder FE to extract multi-layer widely used industrial anomaly detection dataset, contain-
features {fi }3i=1 , and concatenate them as [f1 , f2 , f3 ]. We ing 15 categories of sub-datasets. The training set consists
adopt a projection layer Φ(·) (a linear layer with dropout of 3629 images with anomaly-free samples. The test dataset
block) to transfer its channel dimension to C: includes 1725 normal and abnormal images. Segmentation
fe = Φ([f1 , f2 , f3 ]), (12) masks are provided for anomaly localization evaluation.
VisA (Zou et al. 2022) is a challenging AD dataset contain-
where fe ∈ RH×W ×C is a fusion feature (see Fig. 2 MLF) ing 10821 images and 12 categories of sub-datasets.
and is input to the following MoE module.
Evaluation Metrics Following the prior work (He et al.
Gated Mixture-of-Experts We employ different expert
2024b), image-level Area Under the Receiver Operating
combination to handle different patch patterns, reducing the
Characteristic Curve (I-AUROC) and Average Precision (I-
mutual interference between them. Specifically, for a given
mAP) are applied for anomaly classification. Pixel-level
mini-batch of fusion features, we obtain a batch of patch em-
AUROC, pixel-level mAP, and AUPRO (Bergmann et al.
bedding features {zr }R r=1 , where R = B ∗ H ∗ W . Our goal 2020) are used for anomaly localization.
is to assign different expert combinations to recognize differ-
ent patch patterns. Therefore, we first use a router network Implementation Details The implementation is based
G(·) to assign an expert-correlation score to each patch, i.e., on Pytorch. The publicly available CLIP model (VIT-
Ht = Gt (zr ), t ∈ {1, 2, · · · , T }, (13) L/14@336px) is the backbone of our method. We select the
C T
where G(·) : R 7−→ R , T is the number of expert net- Adam optimizer to train our model. Then, we resize the res-
works {Et (·)}Tt=1 , and each Ei (·) is conducted by a MLP. olution of each image to 224 × 224. In addition, the length
Next, we select experts with top K scores {Hk }K of each learnable text prompt is set to 12, consistent with
k=1 to han-
previous work (Zhou et al. 2023). For both datasets, we set
dle the patch embedding vector zr , denoted by {Ek (·)}Kk=1 . temperature coefficient τ = 0.001 and batch size to 8 with
Next, we obtain a unique combination of processes on each
learning rate 0.001 to train the whole model. The number of
patch embedding, i.e.,
experts is set to 5 with top K = 2 gated scores in the MoE.
K
X Next, we set the epoch to 250 and 200 for MVTec AD and
z∗r = Hk ∗ Ek (zr ). (14) VisA with the same ϑ = 5, respectively. All experiments are
k=1 conducted on a single NVIDIA Tesla V100 32GB GPU.
To prevent the router from assigning dominantly large
weights to a few experts, which can lead to a singular scor- Comprehensive Comparisons with SOTA Methods
ing operation, we apply a universal importance loss (Bengio
et al. 2015) to optimize the MoE: In this section, we compare our approach with several SOTA
PR methods on MVTec AD and VisA datasets, where 5 different
SD( r=1 G(zr ))2 metrics, I-AUROC, P-AURPOC, AUPRO, I-mAP, P-mAP,
Lmoe = 1 PR , (15)
( R r=1 G(zr ))2 + ε are shown for comprehensive evaluation in Table 1 and Ta-
ble 3, respectively.
where SD(·) is standard deviation operation, ε is added
for numerical stability. Finally, according to Eq. (9), (11), Results on MVTec AD As reported in Table 1, for the
and (15), we obtain a total loss to train our model: widely used MVTec AD dataset, our cross-modal normality
Ltotal = Ldistill + Lconstraint + Lmoe . (16) distillation framework (CND) achieves five different metrics
by 98.6/98.0/93.0/99.3/56.4, and the mean performance
Inference of five metrics by 89.0 under multi-class setting. Com-
The inference is consistent with the training phase. We apply pared to UniAD and DiAD, our method improves five met-
encoder FE (·), learned prompts {[pin , pia ]}3i=1 , multi-layer rics by +2.1/1.2/2.3/0.5/13.0 and +1.4/1.2/2.3/0.3/3.8,
fusion Φ(·), MoE module, and decoder FD (·) to produce en- and the mean metric by +3.8 and +1.7, respectively. Our
coded features {fi∗ }3i=1 and decoded features {fbi∗ }3i=1 . We method significantly outperforms the single-modal distil-
design a pixel-level anomaly score map as: lation framework, RD4AD, by +4.0/1.9/1.9/2.8/7.8, in
terms of five metrics. In addition, our method is more stable
3
X   than previous methods, achieving a performance of 93.8+
S(fi∗ , fbi∗ ) = σi 1 − d(fi∗ , fbi∗ ) , (17) in terms of image and pixel-level AUROC for all categories.
i=1 However, RD4AD merely achieves a 60.8 I-AUROC metric
where d(·, ·) is pixel-wise cosine similarity, σ(·) is the up- on Hazelnut, and UniAD achieves a 63.1 P-AUROC metric
sampling factor in order to keep the same size as the input on Grid. To further illustrate the effectiveness of our pro-
image. In addition, the image-level anomaly score is defined posed method in anomaly localization, we visualize UniAD
as the maximum score of the pixel-level anomaly map. and our method prediction detection results in Fig. 3 (B).
Method −→ RD4AD* (Deng and Li 2022) UniAD* (You et al. 2022) DiAD (He et al. 2024b) CND
Category ↓ CVPR2022 NeurIPS2022 AAAI2024 Ours
Bottle 99.6/96.1/91.1/99.9/48.6 97.0/98.1/93.1/100./66.0 99.7/98.4/86.6/99.5/52.2 100./99.0/97.1/100./81.8
Cable 84.1/85.1/75.1/89.5/26.3 95.2/97.3/86.1/95.9/39.9 94.8/96.8/80.5/98.8/50.1 98.9/98.2/92.5/99.3/64.1
Capsule 94.1/98.8/94.8/96.9/43.4 86.9/98.5/92.1/97.8/42.7 89.0/97.1/87.2/97.5/42.0 98.0/98.2/93.8/99.3/36.9
Hazelnut 60.8/97.9/92.7/69.8/36.2 99.8/98.1/94.1/100./55.2 99.5/98.3/91.5/99.7/79.2 100./98.8/94.9/100./53.3
Metal Nut 100./94.8/91.9/100./55.5 99.2/62.7/81.8/99.9/14.6 99.1/97.3/90.6/96.0/30.0 100./95.5/89.2/100./68.4
Pill 97.5/97.5/95.8/99.6/63.4 93.7/95.0/95.3/98.7/44.0 95.7/95.7/89.0/98.5/46.0 96.8/98.8/96.1/99.5/80.2
Screw 97.7/99.4/96.8/99.3/40.2 87.5/98.3/95.2/96.5/28.7 90.7/97.9/95.0/99.7/60.6 93.8/99.0/94.7/98.3/26.2
Toothbrush 97.2/99.0/92.0/99.0/53.6 94.2/98.4/87.9/97.4/34.9 99.7/99.0/95.0/99.9/78.7 99.5/99.0/93.0/99.8/49.7
Transistor 94.2/85.9/74.7/95.2/42.3 99.8/97.9/93.5/98.0/59.5 99.8/95.1/90.0/99.0/15.6 96.0/94.5/74.1/94.4/57.3
Zipper 99.5/98.5/94.1/99.9/53.9 95.8/96.8/92.6/99.5/40.1 99.3/96.2/91.6/99.8/60.7 99.1/97.6/93.8/99.8/45.1
Carpet 98.5/99.0/95.1/99.6/58.5 99.8/98.5/94.4/99.9/49.9 99.4/98.6/90.6/99.9/42.2 99.9/99.3/97.0/99.9/70.7
Grid 98.0/96.5/97.0/99.4/23.0 98.2/63.1/92.9/99.5/10.7 98.5/96.6/94.0/99.8/66.0 99.1/98.4/95.0/99.7/25.8
Leather 100./99.3/97.4/100./38.0 100./98.8/96.8/100./32.9 99.8/98.8/91.3/99.7/56.1 100./99.5/98.7/100./50.1
Tile 98.3/95.3/85.8/99.3/48.5 99.3/91.8/78.4/99.8/42.1 96.8/92.4/90.7/99.9/65.7 100./97.7/94.1/100./73.4
Wood 99.2/95.3/90.0/99.8/47.8 98.6/93.2/86.7/99.6/37.2 99.7/93.3/97.5/100./43.3 98.3/96.4/91.4/99.5/63.3
Mean 94.6/96.1/91.1/96.5/48.6 96.5/96.8/90.7/98.8/43.4 97.2/96.8/90.7/99.0/52.6 98.6/98.0/93.0/99.3/56.4
mTotal 85.4 85.2 87.3 89.0

Table 1: Comprehensive anomaly detection results with image-level AUROC, pixel-level AUROC, AUPRO, image-level mAP,
and pixel-level mAP metrics on MVTec AD dataset. We also provide the average of all metrics at the bottom of the table. The
best and second-best results are in bold and underlined, respectively. *: The results are sourced from (He et al. 2024a).

Sample GT UniAD Ours Sample GT UniAD Ours


MLF CNC MoE Performance Mean
i ✗ ✗ ✗ 95.4/95.6/90.3 93.7
ii ✓ ✗ ✗ 96.0/96.3/91.1 94.4
iii ✗ ✓ ✗ 96.7/97.0/92.0 95.2
iv ✗ ✗ ✓ 96.3/95.8/90.7 94.2
v ✓ ✓ ✗ 98.1/97.7/92.8 96.2
vi ✓ ✗ ✓ 97.0/96.8/91.4 95.0
vii ✓ ✓ ✓ 98.6/98.0/93.0 96.5

Table 2: Ablation study of our method on MVTec AD. MLF:


Multi-Layer Fusion. CNC: Cross-modal Normality Con-
straint, including feature-level normality promotion, con-
straint loss Lconstraint , and distillation loss Ldistill . MoE:
(A) Examples on VisA (B) Examples on MVTec AD
Gated Mixture-of-Expert Module. Bold/underline values in-
Figure 3: Visualization for detection results of UniAD and dicate the best/runner-up.
our method on MVTec AD and VisA datasets.

investigate the impact of hyperparameters of MoE.


Results on VisA As reported in Table 3, for the VisA
dataset, our proposed method also achieves a SOTA Effectiveness of Main Components In Table 2, we re-
performance, namely 93.2 and 98.5 in terms of I- port three main metrics, including I-AUROC, P-AUROC
AUROC and P-AUROC metrics, respectively. Compared and AUPRO, to study the impact of each key component
to previous multi-class methods, our method outper- on MVTec AD. As shown in line i and iii of Table 2,
forms UniAD by +7.7/2.6/15.8/7.1/16.8, and DiAD by CNC improves our base model by +1.3/1.4/1.7. In ad-
+6.4/2.5/16.2/4.3/11.7. Especially, in the ‘fryum’ and dition, equipped with ‘MLF’ module, we get a gain of
‘capsules’ categories, our method greatly improves anomaly +2.1/1.4/1.7 via CNC (see ii and v in Table 2). Therefore,
classification compared to UniAD (You et al. 2022) and (He our proposed CNC successfully suppresses undesired ‘OG’.
et al. 2024b). Finally, we also visualize our obtained local- In addition, the proposed MoE module alleviates ‘OG’ by
ization result via heat maps on VisA in Fig. 3 (A). assigning different weights to different patch patterns, im-
proving performance by +1.0/0.5 in terms of I-AUROC
Ablation Study and P-AUROC (see ii and vi). We also found that MoE en-
In this section, we investigate the contribution of different hances image-level anomaly detection, enhancing +0.5 on
main components in our approach. Additionally, we show I-AUROC metric (see v and vii in Table 2). Finally, our
results on different backbones with different resolutions and designed multi-layer fusion module can well fuse different
Method −→ RD4AD* (Deng and Li 2022) UniAD* (You et al. 2022) DiAD* (He et al. 2024b) CND
Category ↓ CVPR2022 NeurIPS2022 AAAI2024 Ours
pcb1 96.2/99.4/95.8/95.5/66.2 92.8/93.3/64.1/92.7/ 3.9 88.1/98.7/80.2/88.7/49.6 94.1/99.5/92.6/91.7/70.7
pcb2 97.8/98.0/90.8/97.8/22.3 87.8/93.9/66.9/87.7/ 4.2 91.4/95.2/67.0/91.4/ 7.5 95.9/98.4/88.8/92.2/18.1
pcb3 96.4/97.9/93.9/96.2/26.2 78.6/97.3/70.6/78.6/13.8 86.2/96.7/68.9/87.6/ 8.0 92.0/98.6/93.7/93.9/ 21.7
pcb4 99.9/97.8/88.7/99.9/31.4 98.8/94.9/72.3/98.8/14.7 99.6/97.0/85.0/99.5/17.6 99.9/99.0/90.5/99.8/ 40.5
macaroni1 75.9/99.4/95.3/61.5/ 2.9 79.9/97.4/84.0/79.8/ 3.7 85.7/94.1/68.5/85.2/10.2 86.7/98.6/90.5/84.4/ 7.8
macaroni2 88.3/99.7/97.4/84.5/13.2 71.6/95.2/76.6/71.6/ 0.9 62.5/93.6/73.1/57.4/ 0.9 84.4/98.1/93.6/81.3/12.7
capsules 82.2/99.4/93.1/90.4/60.4 55.6/88.7/43.7/55.6/ 3.0 58.2/97.3/77.9/69.0/10.0 83.4/98.4/88. 6/89.2/33.6
candle 92.3/99.1/94.9/92.9/25.3 94.1/98.5/91.6/94.0/17.6 92.8/97.3/89.4/92.0/12.8 93.7/98.4/91.9/90.0/16.7
cashew 92.0/91.7/86.2/95.8/44.2 92.8/98.6/87.9/92.8/51.7 91.5/90.9/61.8/95.7/53.1 94.1/98.1/87.4/92.8/62.9
chewinggum 94.9/98.7/76.9/97.5/59.9 96.3/98.8/81.3/96.2/54.9 99.1/94.7/59.5/99.5/11.9 98.7/99.1/89.4/99.2/61.3
fryum 95.3/97.0/93.4/97.9/47.6 83.0/95.9/76.2/83.0/34.0 89.8/97.6/81.3/95.0/58.6 96.4/97.0/92.1/97.9/47.3
pipe-fryum 97.9/99.1/95.4/98.9/56.8 94.7/98.9/91.5/94.7/50.2 96.2/99.4/89.9/98.1/72.7 98.9/98.4/97.5/99.0/61.4
Mean 92.4/98.1/91.8/92.4/38.0 85.5/95.9/75.6/85.5/21.0 86.8/96.0/75.2/88.3/26.1 93.2/98.5/91.4/92.6/37.8
mTotal 82.5 72.7 74.5 82.7

Table 3: Comprehensive anomaly detection results with five different metrics on VisA dataset. Bold/underline values indicate
the best/runner-up. *: The results are sourced from (He et al. 2024a).

Ablations on Pre-trained Encoder Top K −→


K=1 K=2 K=3 K=4
90 No.Experts ↓
89
88.1 87.9 None 98.1/97.7 ✗ ✗ ✗
88 87.4 87.2
86.6 86.3
T =1 98.0/97.7 ✗ ✗ ✗
86 T =2 97.4/96.7 97.2/96.9 ✗ ✗
84.8
84 T =3 97.4/97.0 98.0/97.3 97.5/97.5 ✗
84 83.4 83.7 T =4 97.7/97.2 98.2/97.4 97.9/97.8 97.3/97.6
82.6
82.1 81.9 T =5 97.6/97.4 98.6/98.0 98.2/97.7 98.0/97.7
82 81.2
T =6 97.3/97.3 98.5/97.3 97.8/97.7 97.4/97.4
80.1
80

78
Table 4: Impact of Hyperparameters in MoE module, where
I-AUROC and P-AUROC metrics are reported on MVTec
76 AD dataset. T and K denote the number of experts and Top
74
K coefficient respectively, and K ≤ T . Bold/underline val-
ViT-B/32 ViT-B/16 ViT-L/14 ViT-L/14* ues indicate the best/runner-up.
128×128 224×224 256×256 512×512

Figure 4: Choices on four pre-trained teacher network Impact of Hyperparameters in MoE According to Ta-
(encoder) with four different resolutions. The verti- ble 4, an appropriate selection on hyperparameter greatly
cal axis represents the average value of I-AUROC/P- improves anomaly localization and classification for our
AUROC/AUPRO/I-mAP/P-mAP. ViT-L/14* denotes pre- method. When T = 1, it is equivalent to connecting an
trained CLIP model, VIT-L/14@336px. adapter and does not significantly affect performance. Both
large and small values of T can degrade performance. We
consider that large T may lead to some experts under-fitting
and small T may result in some experts over-fitting. when
layer information of ViT. As shown i and ii in Table 2, MLF T = 5 and K = 2, the best performance is achieved.
improves three metrics by +0.6/0.7/0.8.

Choices on Pre-trained Encoders and Image Resolutions Conclusion


Fig. 4 shows results of different resolutions for four pre- In this paper, we propose a cross-modal distillation frame-
trained encoder. On the one hand, we found that models work to address the inevitable ‘over-generalization’ in multi-
that perform well in zero-shot classification have higher per- class training. Firstly, we propose cross-modal normality
formance in our framework. We obtain the highest perfor- constraint (CNC) to guide decoded features to align the de-
mance of 89.0 when applying ViT-L/14*, but the perfor- coded features with a textual representation of normality,
mance rapidly degraded when using ViT-B/32 and ViT-B/16. thereby improving the normality of the distilled features and
On the other hand, we found that both low and high resolu- final detection performance. We also propose a gated MoE
tions (128 × 128 and 512 × 512) degrade detection perfor- module to re-weight different patch patterns, reducing the
mance and great performances can be achieved with reso- mutual interference between them. Finally, extensive exper-
lutions 224 × 224 and 256 × 256, which is consistent with iments show that our method achieves competitive perfor-
previous work (Deng and Li 2022). mance on MVTec AD and VisA datasets.
Acknowledgments Guo, J.; Jia, L.; Zhang, W.; Li, H.; et al. 2024. Recontrast:
This work was supported by the National Natural Science Domain-specific anomaly detection via contrastive recon-
Foundation of China (No. 62471405, 62331003, 62301451), struction. In NeurIPS.
Jiangsu Basic Research Program Natural Science Foun- He, H.; Bai, Y.; Zhang, J.; He, Q.; Chen, H.; Gan, Z.; Wang,
dation (SBK2024021981), Suzhou Basic Research Pro- C.; Li, X.; Tian, G.; and Xie, L. 2024a. Mambaad: Explor-
gram (SYG202316) and XJTLU REF-22-01-010, XJTLU ing state space models for multi-class unsupervised anomaly
AI University Research Centre, Jiangsu Province Engi- detection. arXiv preprint arXiv:2404.06564.
neering Research Centre of Data Science and Cognitive He, H.; Zhang, J.; Chen, H.; Chen, X.; Li, Z.; Chen, X.;
Computation at XJTLU and SIP AI innovation platform Wang, Y.; Wang, C.; and Xie, L. 2024b. A diffusion-based
(YZCXPT2022103). framework for multi-class anomaly detection. In AAAI.
Hu, T.; Zhang, J.; Yi, R.; Du, Y.; Chen, X.; Liu, L.; Wang, Y.;
References and Wang, C. 2024. Anomalydiffusion: Few-shot anomaly
Bae, J.; Lee, J.-H.; and Kim, S. 2023. PNI: Industrial image generation with diffusion model. In AAAI.
anomaly detection using position and neighborhood infor- Huang, C.; Guan, H.; Jiang, A.; Zhang, Y.; Spratling, M.;
mation. In ICCV. and Wang, Y.-F. 2022. Registration based few-shot anomaly
Bengio, E.; Bacon, P.-L.; Pineau, J.; and Precup, D. 2015. detection. In ECCV.
Conditional computation in neural networks for faster mod-
Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.;
els. arXiv preprint arXiv:1511.06297.
and Dabeer, O. 2023. Winclip: Zero-/few-shot anomaly clas-
Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C. sification and segmentation. In CVPR.
2019. MVTec AD–A comprehensive real-world dataset for
Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Har-
unsupervised anomaly detection. In CVPR.
iharan, B.; and Lim, S.-N. 2022. Visual prompt tuning. In
Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C. ECCV.
2020. Uninformed students: Student-teacher anomaly de-
tection with discriminative latent embeddings. In CVPR. Jiang, X.; Liu, J.; Wang, J.; Nie, Q.; Wu, K.; Liu, Y.; Wang,
C.; and Zheng, F. 2022. Softpatch: Unsupervised anomaly
Cao, T.; Zhu, J.; and Pang, G. 2023. Anomaly detection detection with noisy data. In NeurIPS.
under distribution shift. In ICCV.
Lee, M.; and Choi, J. 2024. Text-guided variational image
Chen, Y.; Tian, Y.; Pang, G.; and Carneiro, G. 2022. Deep generation for industrial anomaly detection and segmenta-
one-class classification via interpolated gaussian descriptor. tion. In CVPR.
In AAAI.
Lei, J.; Hu, X.; Wang, Y.; and Liu, D. 2023. Pyramidflow:
Costanzino, A.; Ramirez, P. Z.; Lisanti, G.; and Di Stefano, High-resolution defect contrastive localization using pyra-
L. 2024. Multimodal industrial anomaly detection by cross- mid normalizing flow. In CVPR.
modal feature mapping. In CVPR.
Li, C.-L.; Sohn, K.; Yoon, J.; and Pfister, T. 2021. Cutpaste:
Defard, T.; Setkov, A.; Loesch, A.; and Audigier, R. Self-supervised learning for anomaly detection and localiza-
2021. Padim: A patch distribution modeling framework for tion. In CVPR.
anomaly detection and localization. In ICPR.
Li, W.; Xu, X.; Gu, Y.; Zheng, B.; Gao, S.; and Wu, Y.
Deng, H.; and Li, X. 2022. Anomaly detection via reverse
2024a. Towards Scalable 3D Anomaly Detection and Lo-
distillation from one-class embedding. In CVPR.
calization: A Benchmark via 3D Anomaly Synthesis and A
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, Self-Supervised Learning Network. In CVPR.
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.;
Li, Y.; Goodge, A.; Liu, F.; and Foo, C.-S. 2024b. Promp-
Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16
tAD: Zero-shot anomaly detection using text prompts. In
words: Transformers for image recognition at scale. arXiv
CVPR.
preprint arXiv:2010.11929.
Fang, Z.; Wang, X.; Li, H.; Liu, J.; Hu, Q.; and Xiao, J. Lin, J.; and Yan, Y. 2024. A Comprehensive Augmentation
2023. Fastrecon: Few-shot industrial anomaly detection via Framework for Anomaly Detection. In AAAI.
fast feature reconstruction. In ICCV. Liu, J.; Xie, G.; Chen, R.; Li, X.; Wang, J.; Liu, Y.; Wang,
Gu, Z.; Liu, L.; Chen, X.; Yi, R.; Zhang, J.; Wang, Y.; Wang, C.; and Zheng, F. 2024. Real3d-ad: A dataset of point cloud
C.; Shu, A.; Jiang, G.; and Ma, L. 2023. Remembering nor- anomaly detection. In NeurIPS.
mality: Memory-guided knowledge distillation for unsuper- Liu, W.; Chang, H.; Ma, B.; Shan, S.; and Chen, X. 2023a.
vised anomaly detection. In CVPR. Diversity-measurable anomaly detection. In CVPR.
Gu, Z.; Zhang, J.; Liu, L.; Chen, X.; Peng, J.; Gan, Z.; Jiang, Liu, Z.; Zhou, Y.; Xu, Y.; and Wang, Z. 2023b. Simplenet:
G.; Shu, A.; Wang, Y.; and Ma, L. 2024. Rethinking Reverse A simple network for image anomaly detection and localiza-
Distillation for Multi-Modal Anomaly Detection. In AAAI. tion. In CVPR.
Gudovskiy, D.; Ishizaka, S.; and Kozuka, K. 2022. Cflow- Lu, F.; Yao, X.; Fu, C.-W.; and Jia, J. 2023a. Removing
ad: Real-time unsupervised anomaly detection with local- anomalies as noises for industrial defect localization. In
ization via conditional normalizing flows. In WACV. ICCV.
Lu, R.; Wu, Y.; Tian, L.; Wang, D.; Chen, B.; Liu, X.; and Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; and Dabeer, O.
Hu, R. 2023b. Hierarchical vector quantized transformer for 2022. Spot-the-difference self-supervised pre-training for
multi-class unsupervised anomaly detection. In NeurIPS. anomaly detection and segmentation. In ECCV.
Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; and Chi, E. H.
2018. Modeling task relationships in multi-task learning
with multi-gate mixture-of-experts. In KDD.
McIntosh, D.; and Albu, A. B. 2023. Inter-realization chan-
nels: Unsupervised anomaly detection beyond one-class
classification. In CVPR.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
et al. 2021. Learning transferable visual models from natural
language supervision. In ICML.
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.;
and Gehler, P. 2022. Towards total recall in industrial
anomaly detection. In CVPR.
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.;
Hinton, G.; and Dean, J. 2017. Outrageously large neu-
ral networks: The sparsely-gated mixture-of-experts layer.
arXiv preprint arXiv:1701.06538.
Tien, T. D.; Nguyen, A. T.; Tran, N. H.; Huy, T. D.; Duong,
S.; Nguyen, C. D. T.; and Truong, S. Q. 2023. Revisiting
reverse distillation for anomaly detection. In CVPR.
Wang, R.; Hoppe, S.; Monari, E.; and Huber, M. F. 2023.
Defect transfer gan: Diverse defect synthesis for data aug-
mentation. arXiv preprint arXiv:2302.08366.
Yao, X.; Li, R.; Zhang, J.; Sun, J.; and Zhang, C. 2023. Ex-
plicit boundary guided semi-push-pull contrastive learning
for supervised anomaly detection. In CVPR.
Yi, J.; and Yoon, S. 2020. Patch svdd: Patch-level svdd for
anomaly detection and segmentation. In ACCV.
You, Z.; Cui, L.; Shen, Y.; Yang, K.; Lu, X.; Zheng, Y.; and
Le, X. 2022. A unified model for multi-class anomaly de-
tection. In NeurIPS.
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; and
Wu, L. 2021. Fastflow: Unsupervised anomaly detection
and localization via 2d normalizing flows. arXiv preprint
arXiv:2111.07677.
Zavrtanik, V.; Kristan, M.; and Skočaj, D. 2021. Draem-a
discriminatively trained reconstruction embedding for sur-
face anomaly detection. In ICCV.
Zhang, J.; Chen, X.; Wang, Y.; Wang, C.; Liu, Y.; Li, X.;
Yang, M.-H.; and Tao, D. 2023. Exploring plain vit recon-
struction for multi-class unsupervised anomaly detection.
arXiv preprint arXiv:2312.07495.
Zhang, X.; Xu, M.; and Zhou, X. 2024. RealNet: A fea-
ture selection network with realistic synthetic anomaly for
anomaly detection. In CVPR.
Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022. Learning
to prompt for vision-language models. IJCV, 130(9): 2337–
2348.
Zhou, Q.; Pang, G.; Tian, Y.; He, S.; and Chen, J. 2023.
Anomalyclip: Object-agnostic prompt learning for zero-shot
anomaly detection. arXiv preprint arXiv:2310.18961.

You might also like