DROCC: Deep Robust One-Class Classification
DROCC: Deep Robust One-Class Classification
Sachin Goyal 1 Aditi Raghunathan 2 * Moksh Jain 3 * Harsha Simhadri 1 Prateek Jain 1
careful feature engineering when applied to struc- related setting under this framework is classification from
tured domains like images. State-of-the-art meth- limited negative training instances where we require low
ods aim to leverage deep learning to learn appro- false positive rate at test time even over close negatives. This
priate features via two main approaches. The is common in AI systems such as wake-word1 detection
first approach based on predicting transforma- where the wake-words form the positive or special class,
tions (Golan & El-Yaniv, 2018; Hendrycks et al., and for safe operation in the real world, the system should
2019a) while successful in some domains, cru- not fire on inputs that are close but not identical to the wake-
cially depends on an appropriate domain-specific words, no matter how the training data was sampled.
set of transformations that are hard to obtain in Anomaly detection is a well-studied problem with a large
general. The second approach of minimizing a body of research (Aggarwal, 2016; Chandola et al., 2009).
classical one-class loss on the learned final layer Classical approaches for anomaly detection are based on
representations, e.g., DeepSVDD (Ruff et al., modeling the typical data using simple functions over the
2018) suffers from the fundamental drawback of inputs (Schölkopf et al., 1999; Liu et al., 2008; Lakhina
representation collapse. In this work, we propose et al., 2004), such as constructing a minimum-enclosing ball
Deep Robust One Class Classification (DROCC) around the typical data points (Tax & Duin, 2004). While
that is both applicable to most standard domains these techniques are well-suited when the input is featurized
without requiring any side-information and ro- appropriately, they struggle on complex domains like vision
bust to representation collapse. DROCC is based and speech, where hand-designing features is difficult.
on the assumption that the points from the class
of interest lie on a well-sampled, locally linear In contrast, deep learning based anomaly detection methods
low dimensional manifold. Empirical evaluation attempt to automatically learn features, e.g., using CNNs in
demonstrates that DROCC is highly effective in vision (Ruff et al., 2018). However, current approaches to do
two different one-class problem settings and on a so have fundamental limitations. One family of approaches
range of real-world datasets across different do- is based on extending the classical data modeling techniques
mains: tabular data, images (CIFAR and Ima- over the learned representations. However, learning these
geNet), audio, and time-series, offering up to 20% representations jointly with the data modeling layer might
increase in accuracy over the state-of-the-art in lead to degenerate solutions where all the points are mapped
anomaly detection. Code is available at https: to a single point (like origin), and the data modeling layer
//[Link]/microsoft/EdgeML. can now perfectly “fit” the typical data. Recent works like
(Ruff et al., 2018) have proposed some heuristics to mit-
igate this like setting the bias to zero, but such heuristics
1. Introduction are often insufficient in practice (Table 1). The second line
of work (Golan & El-Yaniv, 2018; Bergman & Hoshen,
In this work, we study “one-class” classification where the 2020; Hendrycks et al., 2019b) is based on learning the
goal is to obtain accurate discriminators for a special class. salient geometric structure of the typical data (e.g., orienta-
* tion of the object) by applying specific transformations (e.g.,
Part of the work was done while interning at Microsoft
Research India. 1 Microsoft Research India 2 Stanford Univer- rotations and flips) to the input data and training the discrim-
sity 3 NITK Surathkal. Correspondence to: Prateek Jain <pra- inator to predict applied transformation. If the discriminator
jain@[Link]>. fails to predict the transform accurately, the input does not
have the same orientation as typical data and is considered
Proceedings of the 37 th International Conference on Machine
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- 1
audio or visual cue that triggers some action from the system
thor(s).
DROCC: Deep Robust One-Class Classification
anomalous. In order to be successful, these works critically training, we collect negative instances where Marvin has
rely on side-information in the form of appropriate struc- not been uttered. Standard classification methods tend to
ture/transformations, which is difficult to define in general, identify simple patterns for classification, often relying only
especially for domains like time-series, speech, etc. Even for on some substring of Marvin say Mar. While such a clas-
images, if the normal data has been captured from multiple sifier has good accuracy on the training set, in practice, it
orientations, it is difficult to find appropriate transforma- can have a high FPR as the classifier will mis-fire on utter-
tions. The last set of deep anomaly detection techniques ances like Marvelous or Martha. This exact setting has
use generative models such as autoencoders or generative- been relatively less well-studied, and there is no benchmark
adversarial networks (GANs) (Schlegl et al., 2017a) to learn to evaluate methods. Existing work suggests to simply ex-
to generate the entire typical data distribution which can be pand the training data to include false positives found after
challenging and inaccurate in practice (Table 1). the model is deployed, which is expensive and oftentimes
infeasible or unsafe in real applications.
In this paper, we propose a novel Deep Robust One-Class
Classifiation (DROCC) method for anomaly detection that In contrast, we propose DROCC–LF, an outlier-exposure
attempts to address the drawbacks of previous methods de- style extension of DROCC. Intuitively, DROCC–LF com-
tailed above. DROCC is robust to representation collapse by bines DROCC’s anomaly detection loss (that is over only
involving a discriminative component that is general and is the positive data points) with standard classification loss
empirically accurate on most standard domains like tabular, over the negative data. But, in addition, DROCC–LF ex-
time-series and vision without requiring any additional side- ploits the negative examples to learn a Mahalanobis distance
information. DROCC is motivated by the key observation to compare points over the manifold instead of using the
that generally, the typical data lies on a low-dimensional standard Euclidean distance, which can be inaccurate for
manifold, which is well-sampled in the training data. This high-dimensional data with relatively fewer samples.
is believed to be true even in complex domains such as vi-
We apply DROCC to standard benchmarks from multiple
sion, speech, and natural language (Pless & Souvenir, 2009).
domains such as vision, audio, time-series, tabular data,
As manifolds resemble Euclidean space locally, our dis-
and empirically observe that DROCC is indeed success-
criminative component is based on classifying a point as
ful at modeling the positive (typical) class across all the
anomalous if it is outside the union of small `2 balls around
above mentioned domains and can significantly outperform
the training typical points (See Figure 1a for an illustration).
baselines. For example, when applied to the anomaly detec-
Importantly, the above definition allows us to synthetically
tion task on the benchmark CIFAR-10 dataset, our method
generate anomalous points, and we adaptively generate the
can be up to 20% more accurate than the baselines like
most effective anomalous points while training via a gradi-
DeepSVDD (Ruff et al., 2018), Autoencoder (Sakurada
ent ascent phase reminiscent of adversarial training. In other
& Yairi, 2014), and GAN based methods (Nguyen et al.,
words, DROCC has a gradient ascent phase to adaptively
2019). Similarly, for tabular data benchmarks like Arrhyth-
add anomalous points to our training set and a gradient de-
mia, DROCC can be ≥ 18% more accurate than state-of-
scent phase to minimize the classification loss by learning a
the-art methods (Bergman & Hoshen, 2020; Zong et al.,
representation and a classifier on top of the representations
2018). Finally, for OCLN problem, our method can be upto
to separate typical points from the generated anomalous
10% more accurate than standard baselines.
points. In this way, DROCC automatically learns an appro-
priate representation (like DeepSVDD) but is robust to a In summary, the paper makes the following contributions:
representation collapse as mapping all points to the same
value would lead to poor discrimination between normal • We propose DROCC method that is based on a low-
points and the generated anomalous points. dimensional manifold assumption on the positive class
using which it synthetically and adaptively generates
Next, we study a critical problem similar in flavor to negative instances to provide a general and robust ap-
anomaly detection and outlier exposure (Hendrycks et al., proach to anomaly detection.
2019a), which we refer to as One-class Classification with
Limited Negatives (OCLN). The goal of OCLN is to design • We extend DROCC to a one-class classification prob-
a one-class classifier for a positive class with only limited lem where low FPR on arbitrary negatives is crucial.
negative instances—the space of negatives is huge and is not We also provide an experimental setup to evaluate dif-
well-sampled by the training points. The OCLN classifier ferent methods for this important but relatively less
should have low FPR against arbitrary distribution of neg- studied problem.
atives (or uninteresting class) while still ensuring accurate • Finally, we experiment with DROCC on a wide range
prediction accuracy for positives. For example, consider of datasets across different domains–image, audio,
audio wake-word detection, where the goal is to identify a time-series data and demonstrate the effectiveness of
certain word, say Marvin in a given speech stream. For our method compared to baselines.
DROCC: Deep Robust One-Class Classification
2. Related Work flips perform well when the typical points are from class
’3’ (AUROC 0.957) of MNIST but perform poorly when
Anomaly Detection (AD) has been extensively studied typical points are from class ’8’ (AUROC 0.646). In
owing to its wide applicabilty (Chandola et al., 2009; contrast, the low-dimensional manifold assumption that
Goldstein & Uchida, 2016; Aggarwal, 2016). Classical motivates DROCC is generic and seems to hold across
techniques use simple functions like modeling normal several domains like images, speech, etc. For example,
points via low-dimensional subspace or a tree-structured DROCC obtains AUROC of ∼ 0.97 on both typical class ’8’
partition of the input space to detect anomalies (Schölkopf and typical class ’3’ in MNIST. (See Section 5 for more
et al., 1999; Tax & Duin, 2004; Liu et al., 2008; Lakhina comparison with self-supervision based techniques)
et al., 2004; Gu et al., 2019). In contrast, deep AD methods
attempt to learn appropriate features, while also learning
how to model the typical data points using these features. Side-information based AD: Recently, several AD meth-
They broadly fall into three categories discussed below. ods to explicitly incorporate side-information have been pro-
posed. (Hendrycks et al., 2019a) leverages access to a few
out-of-distribution samples, (Ruff et al., 2020) explores the
AD via generative modeling. Deep Autoencoders as well semisupervised setting where a small set of labeled anoma-
as GAN based methods have been studied extensively lous examples are available. We view these approaches
(Malhotra et al., 2016; Sakurada & Yairi, 2014; Nguyen as complementary to DROCC which does not assume any
et al., 2019; Li et al., 2018; Schlegl et al., 2017b). However, side-information. Finally, OCLN problem is generally mod-
these methods solve a harder problem as they require eled as a binary classification probelm, but outlier exposure
reconstructing the entire input from its low-dimensional (OE) style formulation (Hendrycks et al., 2019b) can be
representation during the decoding step. In contrast, used to combine it with anomaly detection methods. Our
DROCC directly addresses the goal of only identifying if method DROCC–LF builds upon OE approach but exploits
a given point lies somewhere on the manifold, and hence the ”outliers” in a more integrated manner.
tends to be more accurate in practice (see Table 1, 2, 3).
3. Anomaly Detection
Deep Once Class SVM: Deep SVDD (Ruff et al., 2018)
introduced the first deep one-class classification objective Let S ⊆ Rd denote the set of typical, i.e., non-anomalous
for anomaly detection, but suffers from representation data points. A point x ∈ Rd is anomalous or atypical if
collapse issue (see Section 1). In contrast, DROCC is x 6∈ S. Suppose we are given n samples D = [xi ]ni=1 ∈ Rd
robust to such collapse since the training objective requires as training data, where DS = {xi | xi ∈ S} is the set
representations to allow for accurate discrimination between of typical points sampled in the training data and |DS | ≥
typical data points and their perturbations that are off the (1 − ν)|S| i.e. ν 1 fraction of points in D are anomalies.
manifold of the typical data points. Then, the goal in unsupervised anomaly detection (AD) is to
learn a function fθ : Rd 7→ {−1, 1} such fθ (x) = 1 when
x ∈ S and fθ (x) = −1 when x 6∈ S. The anomaly detector
Transformations based methods: Recently, (Golan & is parameterized by some parameters θ.
El-Yaniv, 2018; Hendrycks et al., 2019b) proposed another
approach to AD based on self-supervision. The training Deep Robust One Class Classification: We now present
procedure involves applying different transformations to our approach to unsupervised anomaly detection that we
the typical points and training a classifier to identify the call Deep Robust One Class Classification (DROCC). Our
transform applied. The key assumption is that a point is approach is based on the following hypothesis: The set of
normal iff the transformations applied to the point can typical points S lies on a low dimensional locally linear
be correctly identified, i.e., normal points conform to a manifold that is well-sampled. In other words, outside a
specific structure captured by the transformations. (Golan small radius around a training (typical) point, most points
& El-Yaniv, 2018; Hendrycks et al., 2019b) applied the are anomalous. Furthermore, as manifolds are locally Eu-
method to vision datasets and proposed using rotations, clidean, we can use the standard `2 distance function to
flips etc as the transformations. (Bergman & Hoshen, 2020) compare the points in a small neighborhood. Figure 1a
generalized the method to tabular data by using handcrafted shows a 1-d manifold of the typical points and intuitively,
affine transforms. Naturally, the transformations required by why in a small neighborhood of the training points we can
these methods are heavily domain dependent and are hard use `2 distances. We label the typical points as positive and
to design for domains like time-series. Furthermore, even anomalous points as negative.
for vision tasks, the suitability of a transformation varies Formally, for a DNN architecture fθ : Rd → R parame-
based on the structure of the typical points. For example, terized by θ, and a small radius r > 0, DROCC estimates
as discussed in (Golan & El-Yaniv, 2018), horizontal
DROCC: Deep Robust One-Class Classification
Figure 1. (a) A normal data manifold with red dots representing generated anomalous points in Ni (r). (b) Decision boundary learned by
DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. (c), (d):
first two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is
nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.
special audio command to wake up the system. Here, the measures the ”influence” of j-th coordinate on the output,
data for a special wake word can be collected exhaustively, and is updated every epoch during training.
but the negative class, which is “everything else” cannot be
Similar to (1), we can use the standard projected gradient
sampled properly.
descent-ascent algorithm to optimize the above given saddle
Naturally, we can directly apply standard AD methods (e.g., point problem. Here again, projection onto Ni (r) is the
DROCC) or binary classification methods to the problem. key step. That is, the goal is to find: x̃i = arg minx kx −
However, AD methods ignore the side-information, while zk2 s.t. x ∈ Ni (r). Unlike, Section 3 and Algorithm 1, the
the classification methods’ might generalize only to the above projection is unlikely to be available in closed form
training distribution of negatives and hence might have high and requires more careful arguments.
False Positive Rate (FPR) in the presence of negatives far
from the train distribution. Instead, we propose method Proposition 1. Consider the problem: minx̃ kx̃ −
DROCC–OE that uses an approach similar to outlier ex- zk2 , s.t., r2 ≤ kx̃ − xk2Σ ≤ γ 2 r2 and let δ = z − x.
posure (Hendrycks et al., 2019a), where the optimization If r ≤ kδkΣ ≤ γr, then x̃ = z. Otherwise, the optimal
function is given by a summation of the anomaly detection solution is : x̃ = x + (I + τ Σ)−1 δ, where :
loss and standard cross entropy loss over negatives. The in- 1) If kδkΣ ≤ r,
P δj2 τ 2 σj2 P δj2 σj 2
tuition behind DROCC–OE is that the positive data satisfies τ := arg minτ ≤0 j (1+τ σj )2 , s.t., j (1+τ σj )2 ≥ r ,
the manifold hypothesis of the previous section, and hence 2) If kδkΣ ≥ γ · r,
points off the manifold should be classified as negatives. P δj2 σj2 P δj2 σj ν 2
τ −1 := arg minν≥0 j (ν+σ ) 2 , s.t., j (ν+σj )2 ≤
But the process can be bootstrapped by explicit negatives 2 2
j
Next, we propose DROCC–LF which integrates information See Appendix A for a detailed proof. The above proposi-
from the negatives in a deeper manner than DROCC–OE. tion reduces the projection problem to a non-convex but
In particular, we use negatives to learn input coordinates or one-dimensional optimization problem. We solve this prob-
features which are noisy and should be ignored. As DROCC lem via standard grid search over: τ = [− max1j σj , 0] or
uses Euclidean distance to compare points locally, it might α
ν = [0, 1−α maxj σj ] where α = γ · r/kδkΣ . The al-
struggle due to the noisy coordinates, which DROCC–LF
gorithm is now almost same as Algorithm 1 but uses the
will be able to ignore. Formally, DROCC–LF estimates
above mentioned projection algorithm; see Appendix A for
parameter θlf as: minθ `lf (θ) where,
a pseudo-code of our DROCC–LF method.
n
X
lf 2
` (θ) = λkθk + [`(fθ (xi ), yi ) + µ max `(fθ (x̃i ), −1)], 4.1. OCLN Evaluation Setup
x̃i ∈
i=1 Ni (r) Due to lack of benchmarks, it is difficult to evaluate a solu-
Ni (r) := {x̃i , s.t., r ≤ kx̃i − xi kΣ ≤ γ · r}, tion for OCLN. So, we provide a novel experimental setup
(2) for a wake-word detection and a digit classification prob-
lem, showing that DROCC–LF indeed significantly outper-
and λ > 0, µ > 0 are regularization parameters. Instead of forms standard anomaly detection, binary classification, and
Euclidean distance,
P we use Mahalanobis distance function DROCC–OE on practically relevant metrics (Section 5.2).
kx̃ − xk2Σ = j σj (x̃j − xj )2 where xj , x̃j are the j-th
∂fθ(x)
In particular, our setup is inspired by standard settings en-
coordinate of x and x̃, respectively. σj := ∂xj , i.e., σj countered by real-world detection problems. For example,
DROCC: Deep Robust One-Class Classification
consider the wakeword detection problem, where the goal EEG based time-series dataset from multiple patients.
is to detect a wakeword like say ”Marvin” in a continuous Task is to identify if EEG is abnormal (N = 178, d = 1).
stream of data. In this setting, we are provided a few positive • Audio Commands (Warden, 2018): A multiclass data
examples for Marvin and a few generic negative examples with 35 classes of audio keywords. Data is featurized
from everyday speech. But, in our experiment setup, we using MFCC features with 32 filter banks over 25ms
generate close or difficult negatives by generating exam- length windows with stride of 10ms (N = 98, d = 32).
ples like Arvin, Marvelous etc. Now, in most real-world Dataset preparation is same as Kusupati et al. (2018).
deployments, a critical requirement is low False Positive • CIFAR-10 (Krizhevsky, 2009): Widely used benchmark
Rates, even on such difficult negatives. So, we study various for anomaly detection, 10 classes with 32 × 32 images.
methods with FPR bounded by say 3% or 5% on negative • ImageNet-10: a subset of 10 randomly chosen classes
data that comprises of generic negatives as well as difficult from the ImageNet dataset (Deng et al., 2009) which
close negatives. Now, under FPR constraint, we evaluate contains 224 × 224 color images.
various methods by their recall rate, i.e., based on how many
true positives the method is able to identify. We propose a
similar setup for a digit classification problem as well; see The datasets which we use are all publicly available. We
Section 5.2 for more details. use the train-test splits when already available with a 80-20
split for train and validation set. In all other cases, we use
random 60-20-20 split for train, validation, and test.
5. Empirical Evaluation
DROCC Implementation: The main hyper-parameter of
In this section, we present empirical evaluation of DROCC our algorithm is the radius r which defines the
on two one-class classification problems: Anomaly Detec- √ set Ni (r).
We observe that tweaking radius value around d/2 (where
tion and One-Class Classification with Limited Negatives d is the dimension of the input data ) works the best, as due
(OCLN). We discuss the experimental setup, datasets, base- to zero-mean, unit-variance normalized features, the aver-
lines, and the implementation details. Through experimental √
age distance between random points is ≈ d. We fix γ as 2
results on a wide range of synthetic and real-world datasets, in our experiments unless specified otherwise. Parameter µ
we present strong empirical evidence for the effectiveness (1) is chosen from {0.5 , 1.0}. We use a standard step size
of our approach for one-class classification problems. from {0.1, 0.01} for gradient ascent and from {10−2 , 10−4 }
for gradient descent; we also tune the optimizer ∈ {Adam,
5.1. Anomaly Detection SGD}. See Appendix D for a detailed ablation study. The
Datasets: In all the experiments with multi-class datasets, implementation is available as part of the EdgeML pack-
we follow the standard one-vs-all setting for anomaly detec- age (Dennis et al.). The experiments were run on an Intel
tion: fixing each class once as nominal and treating rest as Xeon CPU with 12 cores clocked at 2.60 GHz and with
anomaly. The model is trained only on the nominal class but NVIDIA Tesla P40 GPU, CUDA 10.2, and cuDNN 7.6.
the test data is sampled from all the classes. For timeseries
datasets, N represents the number of time-steps/frames and 5.1.1. R ESULTS
d represents the input feature length. Synthetic Data: We present results on a simple 2-D sine
We perform experiments on the following datasets: wave dataset to visualize the kind of classifiers learnt by
DROCC. Here, the positive data lies on a 1-D manifold
given in Figure 1a. We observe from Figure 1b that DROCC
• 2-D sine-wave: 1000 points sampled uniformly from a
is able to capture the manifold accurately; whereas the clas-
2-dimensional sine wave (see Figure 1a).
sical methods OC-SVM and DeepSVDD (shown in Ap-
• Abalone (Dua & Graff, 2017): Physical measurements
pendix B) perform poorly as they both try to learn a mini-
of abalone are provided and the task is to predict the age.
mum enclosing ball for the whole set of positive data points.
Classes 3 and 21 are anomalies and classes 8, 9, and 10
are normal (Das et al., 2018). Tabular Data: Table 2 compares DROCC against various
• Arrthythmia (Rayana, 2016): Features derived from ECG classical algorithms: OC-SVM, LOF(Breunig et al., 2000)
and the task is to identify arrhythmic samples. Dimension- as well as deep baselines: DCN(Caron et al., 2018), Au-
ality is 279 but five categorical attributes are discarded. toencoder (Zong et al., 2018), DAGMM(Zong et al., 2018),
Dataset preparation is similar to Zong et al. (2018). DeepSVDD and GOAD(Bergman & Hoshen, 2020) on the
• Thyroid (Rayana, 2016): Determine whether a patient widely used benchmark datasets, Arrhythmia, Thyroid and
referred to the clinic is hypothyroid based on patient’s Abalone. In line with prior work, we use the F1- Score
medical data. Only 6 continuous attributes are considered. for comparing the methods (Bergman & Hoshen, 2020;
Dataset preparation is same as Zong et al. (2018). Zong et al., 2018). A fully-connected network with a single
• Epileptic Seizure Recognition (Andrzejak et al., 2001): hidden layer is used as the base network for all the deep
DROCC: Deep Robust One-Class Classification
Table 1. Average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10. DROCC outperforms baselines on most
classes, with gains as high at 20%, and notably, nearest neighbours beats all the baselines on 2 classes.
ConAD Soft-Bound One-Class Nearest
CIFAR Class OC-SVM IF DCAE AnoGAN DROCC (Ours)
16 Deep SVDD Deep SVDD Neighbour
Airplane 61.6±0.9 60.1±0.7 59 1±5 1 67.1±2.5 77.2 61.7±4.2 61.7±4.1 69.02 81.66 ± 0.22
Automobile 63.8±0.6 50.8±0.6 57.4±2.9 54.7±3.4 63.1 64.8±1.4 65.9±2.1 44.2 76.738 ± 0.99
Bird 50.0±0.5 49.2±0.4 48.9±2.4 52.9±3.0 63.1 49.5±1.4 50.8±0.8 68.27 66.664 ± 0.96
Cat 55.9±1.3 55.1±0.4 58.4±1.2 54.5±1.9 61.5 56.0±1.1 59.1±1.4 51.32 67.132 ± 1.51
Deer 66.0±0.7 49.8±0.4 54.0±1.3 65.1±3.2 63.3 59.1±1.1 60.9±1.1 76.71 73.624 ± 2.00
Dog 62.4±0.8 58.5±0.4 62.2±1.8 60.3±2.6 58.8 62.1±2.4 65.7±2.5 49.97 74.434 ± 1.95
Frog 74.7±0.3 42.9±0.6 51.2±5.2 58.5±1.4 69.1 67.8±2.4 67.7±2.6 72.44 74.426 ± 0.92
Horse 62.6±0.6 55.1±0.7 58.6±2.9 62.5±0.8 64.0 65.2±1.0 67.3±0.9 51.13 71.39 ± 0.22
Ship 74.9±0.4 74.2±0.6 76.8±1.4 75.8±4.1 75.5 75.611.7 75.9±1.2 69.09 80.016 ± 1.69
Truck 75.9±0.3 58.9±0.7 67.3±3.0 66.5±2.8 63.7 71.0±1.1 73.1±1.2 43.33 76.21 ± 0.67
Acknowledgments
We are grateful to Aditya Kusupati, Nagarajan Natarajan,
Sahil Bhatia and Oindrila Saha for helpful discussions and
Figure 3. OCLN on Audio Commands: Comparison of Recall for feedback. AR was funded by an Open Philanthropy AI Fel-
key words Marvin and Seven when the False Positive Rate(FPR)
lowship and Google PhD Fellowship in Machine Learning.
is fixed to be 3% and 5%. DROCC–LF is consistently about 10%
more accurate than all the baseline
References
mentioned dataset. However, for evaluation, we sample Aggarwal, C. C. Outlier Analysis. Springer Publish-
positives from the train distribution, but negatives contain a ing Company, Incorporated, 2nd edition, 2016. ISBN
few challenging OOD points as well. Sampling challenging 3319475770.
negatives itself is a hard task and is the key motivating rea-
Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C.,
son for studying the problem. So, we manually list close-by
David, P., and Elger, C. E. Indications of nonlinear deter-
keywords to Marvin such as: Mar, Vin, Marvelous etc. We
ministic and finite-dimensional structures in time series of
then generate audio snippets for these keywords via a speech
brain electrical activity: Dependence on recording region
synthesis tool 2 with a variety of accents.
and brain state. Physical Review E, 64(6), 2001.
Figure 3 shows that for 3% and 5% FPR settings,
DROCC–LF is significantly more accurate than the base- Bergman, L. and Hoshen, Y. Classification-based anomaly
lines. For example, with FPR=3%, DROCC–LF is 10% detection for general data. In International Conference
more accurate than the baselines. We repeated the same on Learning Representations (ICLR), 2020.
experiment with the keyword: Seven, and observed a similar Boyd, S. and Vandenberghe, L. Convex Optimization. Cam-
trend. See Table 9 in Appendix for the list of the close bridge University Press, USA, 2004. ISBN 0521833787.
negatives which were synthesized for each of the keywords.
In summary, DROCC–LF is able to generalize well against Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof:
negatives that are “close” to the true positives even when identifying density-based local outliers. In Proceedings
such negatives were not supplied with the training data. of the 2000 ACM SIGMOD international conference on
Management of data, pp. 93–104, 2000.
6. Conclusions Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep
We introduced DROCC method for deep anomaly detec- clustering for unsupervised learning of visual features. In
tion. It models normal data points using a low-dimensional Proceedings of the European Conference on Computer
manifold, and hence can compare close point via Euclidean Vision (ECCV), pp. 132–149, 2018.
distance. Based on this intuition, DROCC’s optimization
Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec-
is formulated as a saddle point problem which is solved
tion: A survey. ACM Computing Surveys (CSUR), 41(3),
via standard gradient descent-ascent algorithm. We then
2009.
extended DROCC to OCLN problem where the goal is to
generalize well against arbitrary negatives, assuming posi- Das, S., Islam, M. R., Jayakodi, N. K., and Doppa, J. R.
tive class is well sampled and a small number of negative Active anomaly detection via ensembles, 2018.
points are also available. Both the methods perform sig-
nificantly better than strong baselines, in their respective Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
problem settings. For computational efficiency, we sim- L. Imagenet: A large-scale hierarchical image database.
plified the projection set for both the methods which can In IEEE Conference on Computer Vision and Pattern
perhaps slow down the convergence of the two methods. Recognition (CVPR), 2009.
Designing optimization algorithms that can work with the
Dennis, D. K., Gaurkar, Y., Gopinath, S., Goyal, S., Gupta,
2
[Link] C., Jain, M., Kumar, A., Kusupati, A., Lovett, C., Patil,
services/cognitive-services/text-to-speech/ S. G., Saha, O., and Simhadri, H. V. EdgeML: Machine
DROCC: Deep Robust One-Class Classification
Learning for resource-constrained edge devices. URL Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agar-
[Link] wal, P., and Shroff, G. Lstm-based encoder-decoder for
multi-sensor anomaly detection, 2016. URL https:
Dua, D. and Graff, C. UCI machine learning repository, //[Link]/abs/1607.00148.
2017. URL [Link]
Nguyen, D. T., Lou, Z., Klar, M., and Brox, T. Anomaly
Golan, I. and El-Yaniv, R. Deep anomaly detection us- detection with multiple-hypotheses predictions. In In-
ing geometric transformations. In Advances in Neural ternational Conference on Machine Learning (ICML),
Information Processing Systems (NeurIPS), 2018. 2019.
Goldstein, M. and Uchida, S. A comparative evaluation of Pless, R. and Souvenir, R. A survey of manifold learning
unsupervised anomaly detection algorithms for multivari- for images. IPSJ Transactions on Computer Vision and
ate data. PLOS ONE, 11(4), 2016. Applications, 1, 2009.
Gu, X., Akoglu, L., and Rinaldo, A. Statistical analy- Rayana, S. ODDS library, 2016. URL [Link]
sis of nearest neighbor methods for anomaly detection. [Link].
In Advances in Neural Information Processing Systems
(NeurIPS), 2019. Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Sid-
diqui, S. A., Binder, A., Müller, E., and Kloft, M. Deep
Hendrycks, D., Mazeika, M., and Dietterich, T. Deep one-class classification. In International Conference on
anomaly detection with outlier exposure. In International Machine Learning (ICML), 2018.
Conference on Learning Representations (ICLR), 2019a.
Ruff, L., Vandermeulen, R. A., Görnitz, N., Binder, A.,
Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. Müller, E., Müller, K.-R., and Kloft, M. Deep semi-
Using self-supervised learning can improve model robust- supervised anomaly detection. In International Confer-
ness and uncertainty. In Advances in Neural Information ence on Learning Representations (ICLR), 2020.
Processing Systems (NeurIPS), 2019b.
Sakurada, M. and Yairi, T. Anomaly detection using au-
Krizhevsky, A. Learning multiple layers of features from toencoders with nonlinear dimensionality reduction. In
tiny images, 2009. Proceedings of the MLSDA 2014 2nd Workshop on Ma-
chine Learning for Sensory Data Analysis, 2014.
Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P.,
and Varma, M. Fastgrnn: A fast, accurate, stable and Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
tiny kilobyte sized gated recurrent neural network. In Chen, L.-C. Mobilenetv2: Inverted residuals and lin-
Advances in Neural Information Processing Systems, pp. ear bottlenecks, 2018a. URL [Link]
9017–9028, 2018. abs/1801.04381.
Lakhina, A., Crovella, M., and Diot, C. Diagnosing network- Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
wide traffic anomalies. SIGCOMM Comput. Commun. Chen, L.-C. Mobilenetv2: Inverted residuals and linear
Rev., 34(4), 2004. bottlenecks. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2018b.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed- Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth,
ings of the IEEE, 86(11), 1998. U., and Langs, G. Unsupervised anomaly detection with
generative adversarial networks to guide marker discov-
Li, D., Chen, D., Goh, J., and kiong Ng, S. Anomaly detec- ery. In International conference on information process-
tion with generative adversarial networks for multivariate ing in medical imaging, pp. 146–157. Springer, 2017a.
time series, 2018.
Schlegl, T., Seebck, P., Waldstein, S. M., Schmidt-Erfurth,
Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. U., and Langs, G. Unsupervised anomaly detection with
In Proceedings of the 2008 Eighth IEEE International generative adversarial networks to guide marker discov-
Conference on Data Mining, 2008. ery. Information Processing in Medical Imaging, 2017b.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J.,
Vladu, A. Towards deep learning models resistant to and Platt, J. Support vector method for novelty detection.
adversarial attacks. In International Conference on Learn- In Proceedings of the 12th International Conference on
ing Representations (ICLR), 2018. Neural Information Processing Systems, 1999.
DROCC: Deep Robust One-Class Classification
A. OCLN
Algorithm 2 Training neural networks via DROCC–LF
A.1. DROCC–LF Proof
Input: Training data D = [(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )].
Proof of Proposition 1. Recall the problem: Parameters: Radius r, λ ≥ 0, µ ≥ 0, step-size η, number
of gradient steps m, number of initial training steps n0 .
min kx̃ − zk2 , s.t., r2 ≤ kx̃ − xk2Σ ≤ γ 2 r2 .
x̃ Initial steps: For B = 1, . . . n0
Note that both the constraints cannot be active at the same XB : Batch of training inputs
P
time, so we can consider either r2 ≤ kx̃ − xk2Σ constraint θ = θ − Gradient-Step `(fθ (x), y)
(x,y)
or kx̃ − xk2Σ ≤ γ 2 r2 . Below, we give calculation when the ∈XB
former constraint is active, later’s proof follows along same DROCC steps: For B = n0 , . . . n0 + N
lines. XB : Batch of normal training inputs (y = 1)
∀x ∈ XB : h ∼ N (0, Id )
Let τ ≤ 0 be the Lagrangian multiplier, then the Lagrangian
Adversarial search: For i = 1, . . . m
function of the above problem is given by:
1. `(h) = `(fθ (x + h), −1)
L(x̃, τ ) = kx̃ − zk2 + τ (kx̃ − xk2Σ − r2 ). ∇h `(h)
2. h = h + η k∇ h `(h)k
Using KKT first-order necessary condition (Boyd & Van- 3. h = Projection given by Proposition 1(δ = h)
`itr = λkθk2 +
P
denberghe, 2004), the following should hold for any optimal `(fθ (x), y) + µ`(fθ (x + h), −1)
(x,y)
solution x̃, τ : ∈XB
∇x̃ L(x̃, τ ) = 0. θ = θ − Gradient-Step(`itr )
That is,
x̃ = (I + τ Σ)−1 (z + τ · Σx) = x + (I + τ · Σ)−1 δ,
where δ = z − x. This proves the first part of the lemma. diu
s1
s1
.5
Ra Ra
diu
Radius 1
Table 6. Average AUC for Spherical manifold experiment (Section B.2). Normal points are sampled uniformly from the volume of a unit
sphere and OOD points are sampled from the surface of a unit sphere of varying radius (See Figure 4b). Again DROCC outperforms all
the baselines when the OOD points are quite close to the normal distribution.
Nearest
Radius OC-SVM AutoEncoder DeepSVDD DROCC (Ours)
Neighbor
1.2 100 ± 0.00 92.00 ± 0.00 91.81 ± 2.12 93.26 ± 0.91 99.44 ± 0.10
1.4 100 ± 0.00 92.97 ± 0.00 97.85 ± 1.41 98.81 ± 0.34 99.99 ± 0.00
1.6 100 ± 0.00 92.97 ± 0.00 99.92 ± 0.11 99.99 ± 0.00 100.00 ± 0.00
1.8 100 ± 0.00 91.87 ± 0.00 99.98 ± 0.04 100.00 ± 0.00 100.00 ± 0.00
2.0 100 ± 0.00 91.83 ± 0.00 100 ± 0.00 100.00 ± 0.00 100.00 ± 0.00
Table 7. Average AUC for the synthetic 1-D Sine Wave manifold experiment (Section B.1). Normal points are sampled from a sine wave
and OOD points from a vertically displaced manifold (See Figure 6). The results demonstrate that only DROCC is able to capture the
manifold tightly
Vertical Nearest
OC-SVM AutoEncoder DeepSVDD DROCC (Ours)
Displacement Neighbor
0.2 100 ± 0.00 56.99 ± 0.00 52.48 ± 1.15 65.91 ± 0.64 96.80 ± 0.65
0.4 100 ± 0.00 68.84 ± 0.00 58.59 ± 0.61 78.18 ± 1.67 99.31 ± 0.80
0.6 100 ± 0.00 76.95 ± 0.00 66.59 ± 1.21 82.85 ± 1.96 99.92 ± 0.11
0.8 100 ± 0.00 81.73 ± 0.00 77.42 ± 3.62 86.26 ± 1.69 99.98 ± 0.01
1.0 100 ± 0.00 88.18 ± 0.00 86.14 ± 2.52 90.51 ± 2.62 100 ± 0.00
2.0 100 ± 0.00 98.56 ± 0.00 100 ± 0.00 100 ± 0.00 100 ± 0.00
Displacement = 1.00
(a) (b)
the training doesn’t converge since there is no separating C. LFOC Supplementary Experiments
boundary). Assuming neural networks are implicitly regu-
larized to find the simplest boundary, DROCC with neural In Section 5.2.1, we compared DROCC–LF with various
networks also learns essentially a minimum enclosing ball baselines for the OCLN task where the goal is to learn a
in this case, however, at a slightly larger radius. Therefore, classifier that is accurate for both the positive class and
we get 100% AUC only at radius 1.6 rather than 1 + for the arbitrary OOD negatives. Figure 9 compares the recall
some very small . obtained by different methods on 2 keywords ”Forward”
DROCC: Deep Robust One-Class Classification
80 80
80
70
AUC
AUC
AUC
70 70
60
60 60
50
50 50
0 5 10 15 20 10 20 30 40 10 20 30 40
Radius Radius Radius
Figure 7. Ablation Study : Variation in the performance DROCC when r (with γ = 1) is changed from the optimal value.
80 80 80
AUC
AUC
AUC
70 70 70
60 60 60
50 50 50
0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4
µ µ µ
Figure 8. Ablation Study : Variation in the performance of DROCC with µ (1) which is the weightage given to the loss from adversarially
sampled negative points
Table 9. Synthesized near-negatives for keywords in Audio Com- Table 11. Hyperparameters: CIFAR-10
mands Adversarial
Learning
Marvin Forward Seven Follow Class Radius µ Optimizer Ascent
Rate
mar for one fall Step Size
marlin fervor eleven fellow Airplane 8 1 Adam 0.001 0.001
arvin ward heaven low Automobile 8 0.5 SGD 0.001 0.001
marvik reward when hollow Bird 40 0.5 Adam 0.001 0.001
arvi onward devon wallow Cat 28 1 SGD 0.001 0.001
Deer 32 1 SGD 0.001 0.001
Dog 24 0.5 SGD 0.01 0.001
Frog 36 1 SGD 0.001 0.01
Horse 32 0.5 SGD 0.001 0.001
Ship 28 0.5 SGD 0.001 0.001
Truck 16 0.5 SGD 0.001 0.001
Table 10. Hyperparameters: Tabular Experiments
Adversarial
Learning
Dataset Radius µ Optimizer Ascent
Rate
Step Size
Abalone 3 1.0 Adam 10−3 0.01 D. Ablation Study
Arrhythmia 16 1.0 Adam 10−4 0.01
Thyroid 2.5 1.0 Adam 10−3 0.01 D.1. Hyper-Parameters
Here we analyze the effect of two important hyper-
parameters — radius r of the ball outside, which we sam-
ple negative points (set Ni (r)), and µ which is the weigh-
and ”Follow” with 2 different FPR. Table 9 lists the close tage given to the loss from adversarially generated negative
negatives which were synthesized for each of the keywords. points (See Equation 1). We set γ = 1 and hence recall that
DROCC: Deep Robust One-Class Classification