0% found this document useful (0 votes)
29 views16 pages

DROCC: Deep Robust One-Class Classification

The document introduces Deep Robust One-Class Classification (DROCC), a novel method for anomaly detection that addresses limitations of existing techniques by assuming that typical data points lie on a well-sampled, locally linear low-dimensional manifold. DROCC is robust to representation collapse and effectively models various data types, achieving significant accuracy improvements over state-of-the-art methods in multiple domains. The paper also discusses an extension of DROCC for one-class classification with limited negative instances, demonstrating its effectiveness across diverse datasets.

Uploaded by

gabriel victor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

DROCC: Deep Robust One-Class Classification

The document introduces Deep Robust One-Class Classification (DROCC), a novel method for anomaly detection that addresses limitations of existing techniques by assuming that typical data points lie on a well-sampled, locally linear low-dimensional manifold. DROCC is robust to representation collapse and effectively models various data types, achieving significant accuracy improvements over state-of-the-art methods in multiple domains. The paper also discusses an extension of DROCC for one-class classification with limited negative instances, demonstrating its effectiveness across diverse datasets.

Uploaded by

gabriel victor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DROCC: Deep Robust One-Class Classification

Sachin Goyal 1 Aditi Raghunathan 2 * Moksh Jain 3 * Harsha Simhadri 1 Prateek Jain 1

Abstract Anomaly detection is one of the most well-known problems


Classical approaches for one-class problems such in this setting where we want to identify outliers, i.e. points
as one-class SVM and isolation forest require that do not belong to the typical data (special class). Another
arXiv:2002.12718v2 [[Link]] 15 Aug 2020

careful feature engineering when applied to struc- related setting under this framework is classification from
tured domains like images. State-of-the-art meth- limited negative training instances where we require low
ods aim to leverage deep learning to learn appro- false positive rate at test time even over close negatives. This
priate features via two main approaches. The is common in AI systems such as wake-word1 detection
first approach based on predicting transforma- where the wake-words form the positive or special class,
tions (Golan & El-Yaniv, 2018; Hendrycks et al., and for safe operation in the real world, the system should
2019a) while successful in some domains, cru- not fire on inputs that are close but not identical to the wake-
cially depends on an appropriate domain-specific words, no matter how the training data was sampled.
set of transformations that are hard to obtain in Anomaly detection is a well-studied problem with a large
general. The second approach of minimizing a body of research (Aggarwal, 2016; Chandola et al., 2009).
classical one-class loss on the learned final layer Classical approaches for anomaly detection are based on
representations, e.g., DeepSVDD (Ruff et al., modeling the typical data using simple functions over the
2018) suffers from the fundamental drawback of inputs (Schölkopf et al., 1999; Liu et al., 2008; Lakhina
representation collapse. In this work, we propose et al., 2004), such as constructing a minimum-enclosing ball
Deep Robust One Class Classification (DROCC) around the typical data points (Tax & Duin, 2004). While
that is both applicable to most standard domains these techniques are well-suited when the input is featurized
without requiring any side-information and ro- appropriately, they struggle on complex domains like vision
bust to representation collapse. DROCC is based and speech, where hand-designing features is difficult.
on the assumption that the points from the class
of interest lie on a well-sampled, locally linear In contrast, deep learning based anomaly detection methods
low dimensional manifold. Empirical evaluation attempt to automatically learn features, e.g., using CNNs in
demonstrates that DROCC is highly effective in vision (Ruff et al., 2018). However, current approaches to do
two different one-class problem settings and on a so have fundamental limitations. One family of approaches
range of real-world datasets across different do- is based on extending the classical data modeling techniques
mains: tabular data, images (CIFAR and Ima- over the learned representations. However, learning these
geNet), audio, and time-series, offering up to 20% representations jointly with the data modeling layer might
increase in accuracy over the state-of-the-art in lead to degenerate solutions where all the points are mapped
anomaly detection. Code is available at https: to a single point (like origin), and the data modeling layer
//[Link]/microsoft/EdgeML. can now perfectly “fit” the typical data. Recent works like
(Ruff et al., 2018) have proposed some heuristics to mit-
igate this like setting the bias to zero, but such heuristics
1. Introduction are often insufficient in practice (Table 1). The second line
of work (Golan & El-Yaniv, 2018; Bergman & Hoshen,
In this work, we study “one-class” classification where the 2020; Hendrycks et al., 2019b) is based on learning the
goal is to obtain accurate discriminators for a special class. salient geometric structure of the typical data (e.g., orienta-
* tion of the object) by applying specific transformations (e.g.,
Part of the work was done while interning at Microsoft
Research India. 1 Microsoft Research India 2 Stanford Univer- rotations and flips) to the input data and training the discrim-
sity 3 NITK Surathkal. Correspondence to: Prateek Jain <pra- inator to predict applied transformation. If the discriminator
jain@[Link]>. fails to predict the transform accurately, the input does not
have the same orientation as typical data and is considered
Proceedings of the 37 th International Conference on Machine
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- 1
audio or visual cue that triggers some action from the system
thor(s).
DROCC: Deep Robust One-Class Classification

anomalous. In order to be successful, these works critically training, we collect negative instances where Marvin has
rely on side-information in the form of appropriate struc- not been uttered. Standard classification methods tend to
ture/transformations, which is difficult to define in general, identify simple patterns for classification, often relying only
especially for domains like time-series, speech, etc. Even for on some substring of Marvin say Mar. While such a clas-
images, if the normal data has been captured from multiple sifier has good accuracy on the training set, in practice, it
orientations, it is difficult to find appropriate transforma- can have a high FPR as the classifier will mis-fire on utter-
tions. The last set of deep anomaly detection techniques ances like Marvelous or Martha. This exact setting has
use generative models such as autoencoders or generative- been relatively less well-studied, and there is no benchmark
adversarial networks (GANs) (Schlegl et al., 2017a) to learn to evaluate methods. Existing work suggests to simply ex-
to generate the entire typical data distribution which can be pand the training data to include false positives found after
challenging and inaccurate in practice (Table 1). the model is deployed, which is expensive and oftentimes
infeasible or unsafe in real applications.
In this paper, we propose a novel Deep Robust One-Class
Classifiation (DROCC) method for anomaly detection that In contrast, we propose DROCC–LF, an outlier-exposure
attempts to address the drawbacks of previous methods de- style extension of DROCC. Intuitively, DROCC–LF com-
tailed above. DROCC is robust to representation collapse by bines DROCC’s anomaly detection loss (that is over only
involving a discriminative component that is general and is the positive data points) with standard classification loss
empirically accurate on most standard domains like tabular, over the negative data. But, in addition, DROCC–LF ex-
time-series and vision without requiring any additional side- ploits the negative examples to learn a Mahalanobis distance
information. DROCC is motivated by the key observation to compare points over the manifold instead of using the
that generally, the typical data lies on a low-dimensional standard Euclidean distance, which can be inaccurate for
manifold, which is well-sampled in the training data. This high-dimensional data with relatively fewer samples.
is believed to be true even in complex domains such as vi-
We apply DROCC to standard benchmarks from multiple
sion, speech, and natural language (Pless & Souvenir, 2009).
domains such as vision, audio, time-series, tabular data,
As manifolds resemble Euclidean space locally, our dis-
and empirically observe that DROCC is indeed success-
criminative component is based on classifying a point as
ful at modeling the positive (typical) class across all the
anomalous if it is outside the union of small `2 balls around
above mentioned domains and can significantly outperform
the training typical points (See Figure 1a for an illustration).
baselines. For example, when applied to the anomaly detec-
Importantly, the above definition allows us to synthetically
tion task on the benchmark CIFAR-10 dataset, our method
generate anomalous points, and we adaptively generate the
can be up to 20% more accurate than the baselines like
most effective anomalous points while training via a gradi-
DeepSVDD (Ruff et al., 2018), Autoencoder (Sakurada
ent ascent phase reminiscent of adversarial training. In other
& Yairi, 2014), and GAN based methods (Nguyen et al.,
words, DROCC has a gradient ascent phase to adaptively
2019). Similarly, for tabular data benchmarks like Arrhyth-
add anomalous points to our training set and a gradient de-
mia, DROCC can be ≥ 18% more accurate than state-of-
scent phase to minimize the classification loss by learning a
the-art methods (Bergman & Hoshen, 2020; Zong et al.,
representation and a classifier on top of the representations
2018). Finally, for OCLN problem, our method can be upto
to separate typical points from the generated anomalous
10% more accurate than standard baselines.
points. In this way, DROCC automatically learns an appro-
priate representation (like DeepSVDD) but is robust to a In summary, the paper makes the following contributions:
representation collapse as mapping all points to the same
value would lead to poor discrimination between normal • We propose DROCC method that is based on a low-
points and the generated anomalous points. dimensional manifold assumption on the positive class
using which it synthetically and adaptively generates
Next, we study a critical problem similar in flavor to negative instances to provide a general and robust ap-
anomaly detection and outlier exposure (Hendrycks et al., proach to anomaly detection.
2019a), which we refer to as One-class Classification with
Limited Negatives (OCLN). The goal of OCLN is to design • We extend DROCC to a one-class classification prob-
a one-class classifier for a positive class with only limited lem where low FPR on arbitrary negatives is crucial.
negative instances—the space of negatives is huge and is not We also provide an experimental setup to evaluate dif-
well-sampled by the training points. The OCLN classifier ferent methods for this important but relatively less
should have low FPR against arbitrary distribution of neg- studied problem.
atives (or uninteresting class) while still ensuring accurate • Finally, we experiment with DROCC on a wide range
prediction accuracy for positives. For example, consider of datasets across different domains–image, audio,
audio wake-word detection, where the goal is to identify a time-series data and demonstrate the effectiveness of
certain word, say Marvin in a given speech stream. For our method compared to baselines.
DROCC: Deep Robust One-Class Classification

2. Related Work flips perform well when the typical points are from class
’3’ (AUROC 0.957) of MNIST but perform poorly when
Anomaly Detection (AD) has been extensively studied typical points are from class ’8’ (AUROC 0.646). In
owing to its wide applicabilty (Chandola et al., 2009; contrast, the low-dimensional manifold assumption that
Goldstein & Uchida, 2016; Aggarwal, 2016). Classical motivates DROCC is generic and seems to hold across
techniques use simple functions like modeling normal several domains like images, speech, etc. For example,
points via low-dimensional subspace or a tree-structured DROCC obtains AUROC of ∼ 0.97 on both typical class ’8’
partition of the input space to detect anomalies (Schölkopf and typical class ’3’ in MNIST. (See Section 5 for more
et al., 1999; Tax & Duin, 2004; Liu et al., 2008; Lakhina comparison with self-supervision based techniques)
et al., 2004; Gu et al., 2019). In contrast, deep AD methods
attempt to learn appropriate features, while also learning
how to model the typical data points using these features. Side-information based AD: Recently, several AD meth-
They broadly fall into three categories discussed below. ods to explicitly incorporate side-information have been pro-
posed. (Hendrycks et al., 2019a) leverages access to a few
out-of-distribution samples, (Ruff et al., 2020) explores the
AD via generative modeling. Deep Autoencoders as well semisupervised setting where a small set of labeled anoma-
as GAN based methods have been studied extensively lous examples are available. We view these approaches
(Malhotra et al., 2016; Sakurada & Yairi, 2014; Nguyen as complementary to DROCC which does not assume any
et al., 2019; Li et al., 2018; Schlegl et al., 2017b). However, side-information. Finally, OCLN problem is generally mod-
these methods solve a harder problem as they require eled as a binary classification probelm, but outlier exposure
reconstructing the entire input from its low-dimensional (OE) style formulation (Hendrycks et al., 2019b) can be
representation during the decoding step. In contrast, used to combine it with anomaly detection methods. Our
DROCC directly addresses the goal of only identifying if method DROCC–LF builds upon OE approach but exploits
a given point lies somewhere on the manifold, and hence the ”outliers” in a more integrated manner.
tends to be more accurate in practice (see Table 1, 2, 3).

3. Anomaly Detection
Deep Once Class SVM: Deep SVDD (Ruff et al., 2018)
introduced the first deep one-class classification objective Let S ⊆ Rd denote the set of typical, i.e., non-anomalous
for anomaly detection, but suffers from representation data points. A point x ∈ Rd is anomalous or atypical if
collapse issue (see Section 1). In contrast, DROCC is x 6∈ S. Suppose we are given n samples D = [xi ]ni=1 ∈ Rd
robust to such collapse since the training objective requires as training data, where DS = {xi | xi ∈ S} is the set
representations to allow for accurate discrimination between of typical points sampled in the training data and |DS | ≥
typical data points and their perturbations that are off the (1 − ν)|S| i.e. ν  1 fraction of points in D are anomalies.
manifold of the typical data points. Then, the goal in unsupervised anomaly detection (AD) is to
learn a function fθ : Rd 7→ {−1, 1} such fθ (x) = 1 when
x ∈ S and fθ (x) = −1 when x 6∈ S. The anomaly detector
Transformations based methods: Recently, (Golan & is parameterized by some parameters θ.
El-Yaniv, 2018; Hendrycks et al., 2019b) proposed another
approach to AD based on self-supervision. The training Deep Robust One Class Classification: We now present
procedure involves applying different transformations to our approach to unsupervised anomaly detection that we
the typical points and training a classifier to identify the call Deep Robust One Class Classification (DROCC). Our
transform applied. The key assumption is that a point is approach is based on the following hypothesis: The set of
normal iff the transformations applied to the point can typical points S lies on a low dimensional locally linear
be correctly identified, i.e., normal points conform to a manifold that is well-sampled. In other words, outside a
specific structure captured by the transformations. (Golan small radius around a training (typical) point, most points
& El-Yaniv, 2018; Hendrycks et al., 2019b) applied the are anomalous. Furthermore, as manifolds are locally Eu-
method to vision datasets and proposed using rotations, clidean, we can use the standard `2 distance function to
flips etc as the transformations. (Bergman & Hoshen, 2020) compare the points in a small neighborhood. Figure 1a
generalized the method to tabular data by using handcrafted shows a 1-d manifold of the typical points and intuitively,
affine transforms. Naturally, the transformations required by why in a small neighborhood of the training points we can
these methods are heavily domain dependent and are hard use `2 distances. We label the typical points as positive and
to design for domains like time-series. Furthermore, even anomalous points as negative.
for vision tasks, the suitability of a transformation varies Formally, for a DNN architecture fθ : Rd → R parame-
based on the structure of the typical points. For example, terized by θ, and a small radius r > 0, DROCC estimates
as discussed in (Golan & El-Yaniv, 2018), horizontal
DROCC: Deep Robust One-Class Classification

all the training points and is computationally challenging.


n
Algorithm 1 Training neural networks via DROCC def
So, for computational ease, we redefine Ni (r) = r ≤
Input: Training data D = [x1 , x2 , . . . xn ]. o
Parameters: Radius r, λ ≥ 0, µ ≥ 0, step-size η, number kx̃i − xi k2 ≤ γ · r . In practice, since the positive points
of gradient steps m, number of initial training steps n0 . in S lie on a low dimensional manifold, we empirically find
Initial steps: For B = 1, . . . n0 that the adversarial search over this set does not yield a
XB : Batch of training inputs point that is in S. Further, we use a lower weight on the
 P 
θ = θ − Gradient-Step `(fθ (x), 1) classification loss of the generated negatives so as to guard
x∈XB against possible non-anomalous points in Ni (r). Finally,
DROCC steps: For B = n0 , . . . n0 + N projection onto this set is given by x̃i = xi + α · (z − xi )
XB : Batch of training inputs where β = kz − xi k, and α = γr/β if β ≥ γr (point is too
∀x ∈ XB : h ∼ N (0, Id ) far), α = r/β if β ≤ r and α = 1 otherwise.
Adversarial search: For i = 1, . . . m
Algorithm 1 summarizes our DROCC method. The three
1. `(h) = `(fθ (x + h), −1)
∇h `(h) steps in the adversarial search are performed in parallel
2. h = h + η k∇
α
h `(h)k for each x ∈ B the batch; for simplicity, we present the
3. h = khk ·h where α = r·1[khk ≤ r]+khk·1[r ≤ procedure for a single example x. In step one, we compute
khk ≤P γ · r] + γ · r · 1[khk ≥ γ · r] the loss of the network with respect to a negative label
`itr = λkθk2 + `(fθ (x), 1) + µ`(fθ (x + h), −1) (anomalous point) where we express x̃ as x + h. In step two,
x∈XB
we maximize this loss in order to find the most “adversarial”
θ = θ − Gradient-Step(`itr )
point via normalized steepest ascent. Finally, we project
x̃ onto Ni (r). In order to update the parameters of the
network, we could use any gradient based update rule such
parameter θdr as : minθ `dr (θ), where,
as SGD or other adaptive methods like Adagrad or Adam.
We typically set γ = 2. Parameters λ, µ, η are selected
n via cross-validation. Note that our method allows arbitrary
X
`dr (θ) = λkθk2 + [`(fθ (xi ), 1) + µ max `(fθ (x̃i ), −1)], DNN architecture fθ to represent and classify data points
x̃i ∈
i=1 Ni (r)
xi . Finally, we set ` to be the standard cross-entropy loss.
n
def
Ni (r) = kx̃i − xi k2 ≤ γ · r; r ≤ kx̃i − xj k,
o 4. One-class Classification with Limited
∀j = 1, 2, . . . n , (1) Negatives (OCLN)
In this section, we extend DROCC to address the OCLN
and λ > 0, µ > 0 are regularization parameters. Ni (r) problem. Let D = [(x1 , y1 ), . . . , (xn , yn )] be a given set
captures points off the manifold, i.e., are at least at r distance of points where xi ∈ Rd and yi ∈ {1, −1}. Furthermore,
from all training points. We use an upper bound γ · r for let the mass of positive points’ distribution covered by the
regularizing the optimization problem where γ ≥ 1. ` : training data is significantly higher than that of negative
R × R → R is some classification loss function, and goal is points’ distribution. For example, if data points are sampled
to classify the given normal points xi ’s as positives while from a discrete distribution, with P+ being the marginal
the generated anomalous examples x̃i as negatives. distribution of the positive points and P− be the margin
distribution of the negative points.
P Then, the assumption is:
The above given formulation is a saddle point problem and 1
P
P (x )  1
n− i,yi =−1 − i n+ i,yi =1 P+ (xi ) where n+ ,
is similar to adversarial training where the network is trained
n− are the number of positive and negative training points.
to be robust to worst-case `p ball perturbations around the
inputs (See, for example (Madry et al., 2018)). In DROCC, The goal of OCLN is similar to anomaly detection (AD),
we replace the `p ball with Ni (r), and adopt the standard that is, to identify arbitrary outliers–negative class in this
projected gradient descent-ascent technique to solve the case–correctly despite limited access to negatives’ data dis-
saddle point problem. tribution. So it is an AD problem with side-information in
the form of limited negatives. Intuitively, OCLN problems
Gradient-ascent to generate negatives. A key step in arise in domains where data for special positive class (or set
the gradient descent-ascent algorithm is that of projection of classes) can be collected thoroughly, but the “negative”
onto the Ni (r) set. That is, given z ∈ Rd , the goal is to find class is a catch-all class that cannot be sampled thoroughly
x̃i = arg minu ku − zk2 s.t. u ∈ Ni (r). Now, Ni contains due to its sheer size. Several real-world problems can be
points that are less than γ · r distance away from xi and at naturally modeled by OCLN. For example, consider wake
least r away from all xj ’s. The second constraint involves word detection problems where the goal is to identify a
DROCC: Deep Robust One-Class Classification

(a) (b) (c) (d)

Figure 1. (a) A normal data manifold with red dots representing generated anomalous points in Ni (r). (b) Decision boundary learned by
DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. (c), (d):
first two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is
nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.

special audio command to wake up the system. Here, the measures the ”influence” of j-th coordinate on the output,
data for a special wake word can be collected exhaustively, and is updated every epoch during training.
but the negative class, which is “everything else” cannot be
Similar to (1), we can use the standard projected gradient
sampled properly.
descent-ascent algorithm to optimize the above given saddle
Naturally, we can directly apply standard AD methods (e.g., point problem. Here again, projection onto Ni (r) is the
DROCC) or binary classification methods to the problem. key step. That is, the goal is to find: x̃i = arg minx kx −
However, AD methods ignore the side-information, while zk2 s.t. x ∈ Ni (r). Unlike, Section 3 and Algorithm 1, the
the classification methods’ might generalize only to the above projection is unlikely to be available in closed form
training distribution of negatives and hence might have high and requires more careful arguments.
False Positive Rate (FPR) in the presence of negatives far
from the train distribution. Instead, we propose method Proposition 1. Consider the problem: minx̃ kx̃ −
DROCC–OE that uses an approach similar to outlier ex- zk2 , s.t., r2 ≤ kx̃ − xk2Σ ≤ γ 2 r2 and let δ = z − x.
posure (Hendrycks et al., 2019a), where the optimization If r ≤ kδkΣ ≤ γr, then x̃ = z. Otherwise, the optimal
function is given by a summation of the anomaly detection solution is : x̃ = x + (I + τ Σ)−1 δ, where :
loss and standard cross entropy loss over negatives. The in- 1) If kδkΣ ≤ r,
P δj2 τ 2 σj2 P δj2 σj 2
tuition behind DROCC–OE is that the positive data satisfies τ := arg minτ ≤0 j (1+τ σj )2 , s.t., j (1+τ σj )2 ≥ r ,
the manifold hypothesis of the previous section, and hence 2) If kδkΣ ≥ γ · r,
points off the manifold should be classified as negatives. P δj2 σj2 P δj2 σj ν 2
τ −1 := arg minν≥0 j (ν+σ ) 2 , s.t., j (ν+σj )2 ≤
But the process can be bootstrapped by explicit negatives 2 2
j

from the training data. γ r .

Next, we propose DROCC–LF which integrates information See Appendix A for a detailed proof. The above proposi-
from the negatives in a deeper manner than DROCC–OE. tion reduces the projection problem to a non-convex but
In particular, we use negatives to learn input coordinates or one-dimensional optimization problem. We solve this prob-
features which are noisy and should be ignored. As DROCC lem via standard grid search over: τ = [− max1j σj , 0] or
uses Euclidean distance to compare points locally, it might α
ν = [0, 1−α maxj σj ] where α = γ · r/kδkΣ . The al-
struggle due to the noisy coordinates, which DROCC–LF
gorithm is now almost same as Algorithm 1 but uses the
will be able to ignore. Formally, DROCC–LF estimates
above mentioned projection algorithm; see Appendix A for
parameter θlf as: minθ `lf (θ) where,
a pseudo-code of our DROCC–LF method.
n
X
lf 2
` (θ) = λkθk + [`(fθ (xi ), yi ) + µ max `(fθ (x̃i ), −1)], 4.1. OCLN Evaluation Setup
x̃i ∈
i=1 Ni (r) Due to lack of benchmarks, it is difficult to evaluate a solu-
Ni (r) := {x̃i , s.t., r ≤ kx̃i − xi kΣ ≤ γ · r}, tion for OCLN. So, we provide a novel experimental setup
(2) for a wake-word detection and a digit classification prob-
lem, showing that DROCC–LF indeed significantly outper-
and λ > 0, µ > 0 are regularization parameters. Instead of forms standard anomaly detection, binary classification, and
Euclidean distance,
P we use Mahalanobis distance function DROCC–OE on practically relevant metrics (Section 5.2).
kx̃ − xk2Σ = j σj (x̃j − xj )2 where xj , x̃j are the j-th
∂fθ(x)
In particular, our setup is inspired by standard settings en-
coordinate of x and x̃, respectively. σj := ∂xj , i.e., σj countered by real-world detection problems. For example,
DROCC: Deep Robust One-Class Classification

consider the wakeword detection problem, where the goal EEG based time-series dataset from multiple patients.
is to detect a wakeword like say ”Marvin” in a continuous Task is to identify if EEG is abnormal (N = 178, d = 1).
stream of data. In this setting, we are provided a few positive • Audio Commands (Warden, 2018): A multiclass data
examples for Marvin and a few generic negative examples with 35 classes of audio keywords. Data is featurized
from everyday speech. But, in our experiment setup, we using MFCC features with 32 filter banks over 25ms
generate close or difficult negatives by generating exam- length windows with stride of 10ms (N = 98, d = 32).
ples like Arvin, Marvelous etc. Now, in most real-world Dataset preparation is same as Kusupati et al. (2018).
deployments, a critical requirement is low False Positive • CIFAR-10 (Krizhevsky, 2009): Widely used benchmark
Rates, even on such difficult negatives. So, we study various for anomaly detection, 10 classes with 32 × 32 images.
methods with FPR bounded by say 3% or 5% on negative • ImageNet-10: a subset of 10 randomly chosen classes
data that comprises of generic negatives as well as difficult from the ImageNet dataset (Deng et al., 2009) which
close negatives. Now, under FPR constraint, we evaluate contains 224 × 224 color images.
various methods by their recall rate, i.e., based on how many
true positives the method is able to identify. We propose a
similar setup for a digit classification problem as well; see The datasets which we use are all publicly available. We
Section 5.2 for more details. use the train-test splits when already available with a 80-20
split for train and validation set. In all other cases, we use
random 60-20-20 split for train, validation, and test.
5. Empirical Evaluation
DROCC Implementation: The main hyper-parameter of
In this section, we present empirical evaluation of DROCC our algorithm is the radius r which defines the
on two one-class classification problems: Anomaly Detec- √ set Ni (r).
We observe that tweaking radius value around d/2 (where
tion and One-Class Classification with Limited Negatives d is the dimension of the input data ) works the best, as due
(OCLN). We discuss the experimental setup, datasets, base- to zero-mean, unit-variance normalized features, the aver-
lines, and the implementation details. Through experimental √
age distance between random points is ≈ d. We fix γ as 2
results on a wide range of synthetic and real-world datasets, in our experiments unless specified otherwise. Parameter µ
we present strong empirical evidence for the effectiveness (1) is chosen from {0.5 , 1.0}. We use a standard step size
of our approach for one-class classification problems. from {0.1, 0.01} for gradient ascent and from {10−2 , 10−4 }
for gradient descent; we also tune the optimizer ∈ {Adam,
5.1. Anomaly Detection SGD}. See Appendix D for a detailed ablation study. The
Datasets: In all the experiments with multi-class datasets, implementation is available as part of the EdgeML pack-
we follow the standard one-vs-all setting for anomaly detec- age (Dennis et al.). The experiments were run on an Intel
tion: fixing each class once as nominal and treating rest as Xeon CPU with 12 cores clocked at 2.60 GHz and with
anomaly. The model is trained only on the nominal class but NVIDIA Tesla P40 GPU, CUDA 10.2, and cuDNN 7.6.
the test data is sampled from all the classes. For timeseries
datasets, N represents the number of time-steps/frames and 5.1.1. R ESULTS
d represents the input feature length. Synthetic Data: We present results on a simple 2-D sine
We perform experiments on the following datasets: wave dataset to visualize the kind of classifiers learnt by
DROCC. Here, the positive data lies on a 1-D manifold
given in Figure 1a. We observe from Figure 1b that DROCC
• 2-D sine-wave: 1000 points sampled uniformly from a
is able to capture the manifold accurately; whereas the clas-
2-dimensional sine wave (see Figure 1a).
sical methods OC-SVM and DeepSVDD (shown in Ap-
• Abalone (Dua & Graff, 2017): Physical measurements
pendix B) perform poorly as they both try to learn a mini-
of abalone are provided and the task is to predict the age.
mum enclosing ball for the whole set of positive data points.
Classes 3 and 21 are anomalies and classes 8, 9, and 10
are normal (Das et al., 2018). Tabular Data: Table 2 compares DROCC against various
• Arrthythmia (Rayana, 2016): Features derived from ECG classical algorithms: OC-SVM, LOF(Breunig et al., 2000)
and the task is to identify arrhythmic samples. Dimension- as well as deep baselines: DCN(Caron et al., 2018), Au-
ality is 279 but five categorical attributes are discarded. toencoder (Zong et al., 2018), DAGMM(Zong et al., 2018),
Dataset preparation is similar to Zong et al. (2018). DeepSVDD and GOAD(Bergman & Hoshen, 2020) on the
• Thyroid (Rayana, 2016): Determine whether a patient widely used benchmark datasets, Arrhythmia, Thyroid and
referred to the clinic is hypothyroid based on patient’s Abalone. In line with prior work, we use the F1- Score
medical data. Only 6 continuous attributes are considered. for comparing the methods (Bergman & Hoshen, 2020;
Dataset preparation is same as Zong et al. (2018). Zong et al., 2018). A fully-connected network with a single
• Epileptic Seizure Recognition (Andrzejak et al., 2001): hidden layer is used as the base network for all the deep
DROCC: Deep Robust One-Class Classification

Table 1. Average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10. DROCC outperforms baselines on most
classes, with gains as high at 20%, and notably, nearest neighbours beats all the baselines on 2 classes.
ConAD Soft-Bound One-Class Nearest
CIFAR Class OC-SVM IF DCAE AnoGAN DROCC (Ours)
16 Deep SVDD Deep SVDD Neighbour
Airplane 61.6±0.9 60.1±0.7 59 1±5 1 67.1±2.5 77.2 61.7±4.2 61.7±4.1 69.02 81.66 ± 0.22
Automobile 63.8±0.6 50.8±0.6 57.4±2.9 54.7±3.4 63.1 64.8±1.4 65.9±2.1 44.2 76.738 ± 0.99
Bird 50.0±0.5 49.2±0.4 48.9±2.4 52.9±3.0 63.1 49.5±1.4 50.8±0.8 68.27 66.664 ± 0.96
Cat 55.9±1.3 55.1±0.4 58.4±1.2 54.5±1.9 61.5 56.0±1.1 59.1±1.4 51.32 67.132 ± 1.51
Deer 66.0±0.7 49.8±0.4 54.0±1.3 65.1±3.2 63.3 59.1±1.1 60.9±1.1 76.71 73.624 ± 2.00
Dog 62.4±0.8 58.5±0.4 62.2±1.8 60.3±2.6 58.8 62.1±2.4 65.7±2.5 49.97 74.434 ± 1.95
Frog 74.7±0.3 42.9±0.6 51.2±5.2 58.5±1.4 69.1 67.8±2.4 67.7±2.6 72.44 74.426 ± 0.92
Horse 62.6±0.6 55.1±0.7 58.6±2.9 62.5±0.8 64.0 65.2±1.0 67.3±0.9 51.13 71.39 ± 0.22
Ship 74.9±0.4 74.2±0.6 76.8±1.4 75.8±4.1 75.5 75.611.7 75.9±1.2 69.09 80.016 ± 1.69
Truck 75.9±0.3 58.9±0.7 67.3±3.0 66.5±2.8 63.7 71.0±1.1 73.1±1.2 43.33 76.21 ± 0.67

baseline numbers from OC-SVM (Schölkopf et al., 1999), IF


Table 2. F1-Score (with standard deviation) for one-vs-all anomaly
(Liu et al., 2008), DCAE (Seeböck et al., 2016), AnoGAN
detection on Thyroid, Arrhythmia, and Abalone datasets. DROCC
outperforms the baselines on all the three datasets. (Schlegl et al., 2017b), DeepSVDD as reported by Ruff et al.
F1-Score (2018) and against ConvAD16 as reported by Nguyen et al.
Method Thyroid Arrhythmia Abalone (2019). Again, we include nearest neighbours as one of
OC-SVM (Schölkopf et al., 1999) 0.39 ± 0.01 0.46 ± 0.00 0.48 ± 0.00
DCN(Caron et al., 2018) 0.33 ± 0.03 0.38 ± 0.03 0.40 ± 0.01 the baselines. LeNet (LeCun et al., 1998) architecture was
E2E-AE (Zong et al., 2018) 0.13 ± 0.04 0.45 ± 0.03 0.33 ± 0.03 used for all the baselines and DROCC for this experiment.
LOF (Breunig et al., 2000) 0.54 ± 0.01 0.51 ± 0.01 0.33 ± 0.01
DAGMM (Zong et al., 2018) 0.49 ± 0.04 0.49 ± 0.03 0.20 ± 0.03 DROCC consistently achieves the best performance on most
DeepSVDD (Ruff et al., 2018) 0.73 ± 0.00 0.54 ± 0.01 0.62 ± 0.01 classes, with gains as high as 20% over DeepSVDD on some
GOAD (Bergman & Hoshen, 2020) 0.72 ± 0.01 0.51 ± 0.02 0.61 ± 0.02
DROCC (Ours) 0.78 ± 0.03 0.69 ± 0.02 0.68 ± 0.02
classes. An interesting observation is that for the classes
Bird and Deer, Nearest Neighbour achieves competitive
performance, beating all the other baselines.
Table 3. AUC (with standard deviation) for one-vs-all anomaly As discussed in Section 2, (Golan & El-Yaniv, 2018;
detection on Epileptic Seizures and Audio Keyword “Marvin”. Hendrycks et al., 2019b) use domain specific transforma-
DROCC outperforms the baselines on both the datasets tions like flip and rotations to perform the AD task. The
AUC performance of these approaches is heavily dependent on the
Method
Epileptic Seizure Audio Keywords
kNN 91.75 65.81 interaction between transformations and the dataset. They
AE (Sakurada & Yairi, 2014) 91.15 ± 1.7 51.49 ± 1.9 would suffer significantly in more realistic settings where
REBM (Zhai et al., 2016) 97.24 ± 2.1 63.73 ± 2.4 the images of normal class itself have been captured from
DeepSVDD (Ruff et al., 2018) 94.84 ± 1.7 68.38 ± 1.8
DROCC (Ours) 98.23 ± 0.7 70.21 ± 1.1
multiple orientations. For example, even in CIFAR, for
airplane class, the accuracy is relatively low (DROCC is
7% more accurate) as the images have airplanes in multiple
angles. In fact, we try to mimic a more realistic scenario
baselines. We observe significant gains across all the three by augmenting the CIFAR-10 data with flips and small rota-
datasets for DROCC, as high as 18% in Arrhythmia. tions of angle ±30◦ . Table 4 depicts the drop in performance
Time-Series Data: There is a lack of work on anomaly of GEOM when augmentations are added in the CIFAR-10
detection for time-series datasets. Hence we extensively dataset. For example, on the deer class of CIFAR-10 dataset,
evaluate our method DROCC against deep baselines like GEOM has an AUC of 87.8%, which falls to 65.8% when
AutoEncoders (Sakurada & Yairi, 2014), REBM (Zhai et al., augmented CIFAR-10 is used whereas DROCC s perfor-
2016) and DeepSVDD. For autoencoders, we use the archi- mance remains the same (∼ 72%).
tecture presented in Srivastava et al. (2015). A single layer Next, we benchmark the performance of DROCC on high
LSTM is used for all the deep baselines. Motivated by recent resolution images that require the use of large modern neu-
analysis (Gu et al., 2019), we also include nearest neigh- ral architectures. Table 5 presents the results of our experi-
bours as a baseline. Table 3 compares the performance of ments on ImageNet. DROCC continues to achieve the best
DROCC against these baselines on the univariate Epileptic results amongst all the compared methods. Autoencoder
Seizure dataset, and the Audio Commands dataset. DROCC fails drastically on this dataset, so we exclude comparisons.
outperforms the baselines on both the datasets. For DeepSVDD and DROCC, MobileNetv2 (Sandler et al.,
Image Data: For experiments on image datasets, we fixed 2018b) architecture is used. We observe that for all classes,
γ as 1. Table 1 compares DROCC on CIFAR-10 against except golf ball, DROCC outperforms the baselines. For
DROCC: Deep Robust One-Class Classification

Table 4. Comparing DROCC against GEOM (Golan & El-Yaniv,


2018) on CIFAR-10 data flipped and rotated by a small angle of
±30 degree
GEOM DROCC GEOM DROCC
CIFAR-10 Class
(No Aug) (No Aug) (with Aug) (with Aug)
Airplane 74.7 ± 0.4 81.6 ± 0.2 62.4 ± 1.7 77.2 ± 1.2
Automobile 95.7 ± 0.0 76.7 ± 1.0 71.8 ± 1.2 74.5 ± 1.8
Bird 78.1 ± 0.4 66.7 ± 1.0 50.6 ± 0.5 67.5 ± 1.0
Cat 72.4 ± 0.5 67.1 ± 1.5 52.5 ± 0.7 68.8 ± 2.3
Deer 87.8 ± 0.2 73.6 ± 2.0 65.7 ± 1.7 71.1 ± 2.9
Dog 87.8 ± 0.1 74.4 ± 1.9 69.6 ± 1.3 71.3 ± 0.4
Frog 83.4 ± 0.5 74.4 ± 0.9 68.3 ± 1.1 71.2 ± 1.6
Horse 95.5 ± 0.1 71.4 ± 0.2 84.8 ± 0.8 63.5 ± 3.5
Ship 93.3 ± 0.0 80.0 ± 1.7 79.6 ± 2.2 76.4 ± 3.5
Truck 91.3 ± 0.1 76.2 ± 0.7 85.7 ± 0.5 74.0 ± 1.0

Figure 2. Sample postives, negatives and close negatives for


MNIST digit 0 vs 1 experiment (OCLN).
Table 5. Average AUC (with standard deviation) for one-vs-all
anomaly detection on ImageNet. DROCC consistently achieves
the best performance for all but one class.
Nearest architecture across all the baselines.
ImageNet Class DeepSVDD DROCC (Ours)
Neighbor
Tench 65.57 65.14 ± 1.03 70.19 ± 1.7 5.2.1. R ESULTS
English Springer 56.37 66.45 ± 3.16 70.45 ± 4.99
Cassette Player 47.7 60.47 ± 5.35 71.17 ± 1 Synthetic Data: We sample 1024 points in R10 , where the
Chainsaw 45.22 59.43 ± 4.13 68.63 ± 1.86 first two coordinates are sampled from the 2D-sine wave,
Church 61.35 56.31 ± 4.23 67.46 ± 4.17 as in the previous section. Coordinates 3 to 10 are sam-
French Horn 50.52 53.06 ± 6.52 76.97 ± 1.67
pled from the spherical Gaussian distribution. Note that due
Garbage Truck 54.2 62.15 ± 4.39 69.06 ± 2.34
Gas Pump 47.43 56.66 ± 1.49 69.94 ± 0.57 to the√8 noisy dimensions, DROCC would be forced to set
Golf Ball 70.36 72.23 ± 3.43 70.72 ± 3.83 r = d where d = 10, while the true low-dimensional man-
Parachute 75.87 81.35 ± 3.73 93.5 ± 1.41 ifold is restricted to only two dimensions. Consequently,
it learns an inaccurate boundary as shown in Figure 1c
and is similar to the boundary learned by OC-SVM and
instance, on French-Horn vs. rest problem, DROCC is 23% DeepSVDD; points that are predicted to be positive by
more accurate than DeepSVDD. DROCC are colored blue. In contrast, DROCC–LF is able
to learn that only the first two coordinates are useful for
5.2. One-class Classification with Limited Negatives the distinction between positives and negatives, and hence
(OCLN) is able to learn a skewed distance function, leading to an
accurate decision boundary (see Figure 1d).
Recall that the goal in OCLN is to learn a classifier that is
accurate for both, the in-sample positive (or normal) class MNIST 0 vs. 1 Classification: We consider an experimen-
points and for the arbitrary out-of-distribution (OOD) neg- tal setup on MNIST dataset, where the training data consists
atives. Naturally, the key metric for this problem is False of Digit 0, the normal class, and the Digit 1 as the anomaly.
Positive Rate (FPR). In our experiments, we bound any During evaluation, in addition to samples from training dis-
method to have FPR to be smaller than a threshold, and tribution, we also have half zeros, which act as challenging
under that constraint, we measure it’s recall value, i.e., the OOD points (close negatives). These half zeros are gen-
fraction of true positives that are correctly predicted. erated by randomly masking 50% of the pixels (Figure 2).
BCE performs poorly, with a recall of 54% only at a fixed
We compare DROCC–LF against the following baselines:
FPR of 3%. DROCC–OE gives a recall value of 98.16%
a) Standard binary classifier: that is, we ignore the chal-
outperforming DeepSAD by a margin of 7%, which gives
lenge of OOD negatives and treat the problem as a standard
a recall value of 90.91%. DROCC–LF provides further
classification task, b) DeepSAD (Ruff et al., 2020): a semi-
improvement with a recall of 99.4% at 3% FPR.
supervised anomaly detection method but it is not explicitly
designed to handle negatives that are very close to positives Wakeword Detection: Finally, we evaluate DROCC–LF
(OOD negatives) and c) DROCC–OE: Outlier exposure type on the practical problem of wakeword detection with low
extension where DROCC’s anomaly detection loss (based FPR against arbitrary OOD negatives. To this end, we iden-
on using Euclidean distance as a local distance measure over tify a keyword, say “Marvin” from the audio commands
the manifold) is combined with standard cross-entropy loss dataset (Warden, 2018) as the positive class, and the remain-
over the given limited negative data. Similar to the anomaly ing 34 keywords are labeled as the negative class. For train-
detection experiments, we use the same underlying network ing, we sample points uniformly at random from the above
DROCC: Deep Robust One-Class Classification

stricter set is an exciting research direction. Further, we


would also like to rigorously analyse DROCC, assuming
enough samples from a low-curvature manifold. Finally, as
OCLN is an exciting problem that routinely comes up in a
variety of real-world applications, we would like to apply
DROCC–LF to a few high impact scenarios.

Acknowledgments
We are grateful to Aditya Kusupati, Nagarajan Natarajan,
Sahil Bhatia and Oindrila Saha for helpful discussions and
Figure 3. OCLN on Audio Commands: Comparison of Recall for feedback. AR was funded by an Open Philanthropy AI Fel-
key words Marvin and Seven when the False Positive Rate(FPR)
lowship and Google PhD Fellowship in Machine Learning.
is fixed to be 3% and 5%. DROCC–LF is consistently about 10%
more accurate than all the baseline
References
mentioned dataset. However, for evaluation, we sample Aggarwal, C. C. Outlier Analysis. Springer Publish-
positives from the train distribution, but negatives contain a ing Company, Incorporated, 2nd edition, 2016. ISBN
few challenging OOD points as well. Sampling challenging 3319475770.
negatives itself is a hard task and is the key motivating rea-
Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C.,
son for studying the problem. So, we manually list close-by
David, P., and Elger, C. E. Indications of nonlinear deter-
keywords to Marvin such as: Mar, Vin, Marvelous etc. We
ministic and finite-dimensional structures in time series of
then generate audio snippets for these keywords via a speech
brain electrical activity: Dependence on recording region
synthesis tool 2 with a variety of accents.
and brain state. Physical Review E, 64(6), 2001.
Figure 3 shows that for 3% and 5% FPR settings,
DROCC–LF is significantly more accurate than the base- Bergman, L. and Hoshen, Y. Classification-based anomaly
lines. For example, with FPR=3%, DROCC–LF is 10% detection for general data. In International Conference
more accurate than the baselines. We repeated the same on Learning Representations (ICLR), 2020.
experiment with the keyword: Seven, and observed a similar Boyd, S. and Vandenberghe, L. Convex Optimization. Cam-
trend. See Table 9 in Appendix for the list of the close bridge University Press, USA, 2004. ISBN 0521833787.
negatives which were synthesized for each of the keywords.
In summary, DROCC–LF is able to generalize well against Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof:
negatives that are “close” to the true positives even when identifying density-based local outliers. In Proceedings
such negatives were not supplied with the training data. of the 2000 ACM SIGMOD international conference on
Management of data, pp. 93–104, 2000.
6. Conclusions Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep
We introduced DROCC method for deep anomaly detec- clustering for unsupervised learning of visual features. In
tion. It models normal data points using a low-dimensional Proceedings of the European Conference on Computer
manifold, and hence can compare close point via Euclidean Vision (ECCV), pp. 132–149, 2018.
distance. Based on this intuition, DROCC’s optimization
Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec-
is formulated as a saddle point problem which is solved
tion: A survey. ACM Computing Surveys (CSUR), 41(3),
via standard gradient descent-ascent algorithm. We then
2009.
extended DROCC to OCLN problem where the goal is to
generalize well against arbitrary negatives, assuming posi- Das, S., Islam, M. R., Jayakodi, N. K., and Doppa, J. R.
tive class is well sampled and a small number of negative Active anomaly detection via ensembles, 2018.
points are also available. Both the methods perform sig-
nificantly better than strong baselines, in their respective Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
problem settings. For computational efficiency, we sim- L. Imagenet: A large-scale hierarchical image database.
plified the projection set for both the methods which can In IEEE Conference on Computer Vision and Pattern
perhaps slow down the convergence of the two methods. Recognition (CVPR), 2009.
Designing optimization algorithms that can work with the
Dennis, D. K., Gaurkar, Y., Gopinath, S., Goyal, S., Gupta,
2
[Link] C., Jain, M., Kumar, A., Kusupati, A., Lovett, C., Patil,
services/cognitive-services/text-to-speech/ S. G., Saha, O., and Simhadri, H. V. EdgeML: Machine
DROCC: Deep Robust One-Class Classification

Learning for resource-constrained edge devices. URL Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agar-
[Link] wal, P., and Shroff, G. Lstm-based encoder-decoder for
multi-sensor anomaly detection, 2016. URL https:
Dua, D. and Graff, C. UCI machine learning repository, //[Link]/abs/1607.00148.
2017. URL [Link]
Nguyen, D. T., Lou, Z., Klar, M., and Brox, T. Anomaly
Golan, I. and El-Yaniv, R. Deep anomaly detection us- detection with multiple-hypotheses predictions. In In-
ing geometric transformations. In Advances in Neural ternational Conference on Machine Learning (ICML),
Information Processing Systems (NeurIPS), 2018. 2019.
Goldstein, M. and Uchida, S. A comparative evaluation of Pless, R. and Souvenir, R. A survey of manifold learning
unsupervised anomaly detection algorithms for multivari- for images. IPSJ Transactions on Computer Vision and
ate data. PLOS ONE, 11(4), 2016. Applications, 1, 2009.

Gu, X., Akoglu, L., and Rinaldo, A. Statistical analy- Rayana, S. ODDS library, 2016. URL [Link]
sis of nearest neighbor methods for anomaly detection. [Link].
In Advances in Neural Information Processing Systems
(NeurIPS), 2019. Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Sid-
diqui, S. A., Binder, A., Müller, E., and Kloft, M. Deep
Hendrycks, D., Mazeika, M., and Dietterich, T. Deep one-class classification. In International Conference on
anomaly detection with outlier exposure. In International Machine Learning (ICML), 2018.
Conference on Learning Representations (ICLR), 2019a.
Ruff, L., Vandermeulen, R. A., Görnitz, N., Binder, A.,
Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. Müller, E., Müller, K.-R., and Kloft, M. Deep semi-
Using self-supervised learning can improve model robust- supervised anomaly detection. In International Confer-
ness and uncertainty. In Advances in Neural Information ence on Learning Representations (ICLR), 2020.
Processing Systems (NeurIPS), 2019b.
Sakurada, M. and Yairi, T. Anomaly detection using au-
Krizhevsky, A. Learning multiple layers of features from toencoders with nonlinear dimensionality reduction. In
tiny images, 2009. Proceedings of the MLSDA 2014 2nd Workshop on Ma-
chine Learning for Sensory Data Analysis, 2014.
Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P.,
and Varma, M. Fastgrnn: A fast, accurate, stable and Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
tiny kilobyte sized gated recurrent neural network. In Chen, L.-C. Mobilenetv2: Inverted residuals and lin-
Advances in Neural Information Processing Systems, pp. ear bottlenecks, 2018a. URL [Link]
9017–9028, 2018. abs/1801.04381.

Lakhina, A., Crovella, M., and Diot, C. Diagnosing network- Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
wide traffic anomalies. SIGCOMM Comput. Commun. Chen, L.-C. Mobilenetv2: Inverted residuals and linear
Rev., 34(4), 2004. bottlenecks. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2018b.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed- Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth,
ings of the IEEE, 86(11), 1998. U., and Langs, G. Unsupervised anomaly detection with
generative adversarial networks to guide marker discov-
Li, D., Chen, D., Goh, J., and kiong Ng, S. Anomaly detec- ery. In International conference on information process-
tion with generative adversarial networks for multivariate ing in medical imaging, pp. 146–157. Springer, 2017a.
time series, 2018.
Schlegl, T., Seebck, P., Waldstein, S. M., Schmidt-Erfurth,
Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. U., and Langs, G. Unsupervised anomaly detection with
In Proceedings of the 2008 Eighth IEEE International generative adversarial networks to guide marker discov-
Conference on Data Mining, 2008. ery. Information Processing in Medical Imaging, 2017b.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J.,
Vladu, A. Towards deep learning models resistant to and Platt, J. Support vector method for novelty detection.
adversarial attacks. In International Conference on Learn- In Proceedings of the 12th International Conference on
ing Representations (ICLR), 2018. Neural Information Processing Systems, 1999.
DROCC: Deep Robust One-Class Classification

Seeböck, P., Waldstein, S., Klimscha, S., Gerendas, B. S.,


Donner, R., Schlegl, T., Schmidt-Erfurth, U., and Langs,
G. Identifying and categorizing anomalies in retinal imag-
ing data. arXiv preprint arXiv:1612.00686, 2016.
Srivastava, N., Mansimov, E., and Salakhudinov, R. Unsu-
pervised learning of video representations using lstms. In
International Conference on Machine Learning (ICML),
2015.
Tax, D. M. and Duin, R. P. Support vector data description.
Machine Learning, 54(1), 2004.
Warden, P. Speech commands: A dataset for limited-
vocabulary speech recognition, 2018. URL https:
//[Link]/abs/1804.03209.
Zhai, S., Cheng, Y., Lu, W., and Zhang, Z. Deep struc-
tured energy based models for anomaly detection. In
International Conference on Machine Learning (ICML),
2016.

Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu,


C., Cho, D., and Chen, H. Deep autoencoding gaussian
mixture model for unsupervised anomaly detection. In
International Conference on Learning Representations,
2018. URL [Link]
id=BJJLHbb0-.
DROCC: Deep Robust One-Class Classification

A. OCLN
Algorithm 2 Training neural networks via DROCC–LF
A.1. DROCC–LF Proof
Input: Training data D = [(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )].
Proof of Proposition 1. Recall the problem: Parameters: Radius r, λ ≥ 0, µ ≥ 0, step-size η, number
of gradient steps m, number of initial training steps n0 .
min kx̃ − zk2 , s.t., r2 ≤ kx̃ − xk2Σ ≤ γ 2 r2 .
x̃ Initial steps: For B = 1, . . . n0
Note that both the constraints cannot be active at the same XB : Batch of training inputs
 P 
time, so we can consider either r2 ≤ kx̃ − xk2Σ constraint θ = θ − Gradient-Step `(fθ (x), y)
(x,y)
or kx̃ − xk2Σ ≤ γ 2 r2 . Below, we give calculation when the ∈XB
former constraint is active, later’s proof follows along same DROCC steps: For B = n0 , . . . n0 + N
lines. XB : Batch of normal training inputs (y = 1)
∀x ∈ XB : h ∼ N (0, Id )
Let τ ≤ 0 be the Lagrangian multiplier, then the Lagrangian
Adversarial search: For i = 1, . . . m
function of the above problem is given by:
1. `(h) = `(fθ (x + h), −1)
L(x̃, τ ) = kx̃ − zk2 + τ (kx̃ − xk2Σ − r2 ). ∇h `(h)
2. h = h + η k∇ h `(h)k

Using KKT first-order necessary condition (Boyd & Van- 3. h = Projection given by Proposition 1(δ = h)
`itr = λkθk2 +
P
denberghe, 2004), the following should hold for any optimal `(fθ (x), y) + µ`(fθ (x + h), −1)
(x,y)
solution x̃, τ : ∈XB
∇x̃ L(x̃, τ ) = 0. θ = θ − Gradient-Step(`itr )
That is,
x̃ = (I + τ Σ)−1 (z + τ · Σx) = x + (I + τ · Σ)−1 δ,
where δ = z − x. This proves the first part of the lemma. diu
s1
s1
.5
Ra Ra
diu

Radius 1

Now, by using primal and dual feasibility required by the


KKT conditions, we have:
min kx̃ − zk2 , s.t., kx̃ − xk2Σ ≥ r2 ,
τ ≤0

where x̃ = (I +τ Σ)−1 (z+τ ·Σx) = x+(I +τ ·Σ)−1 δ. The (a) (b)


lemma now follows by substituting x̃ above and by using Figure 4. (a) Spherical manifold (a unit sphere) that captures the
the fact that Σ is a diagonal matrix with Σ(i, i) = σi . normal data distribution. Points are uniformly sampled from the
volume of the unit sphere. (b) OOD points (red) are sampled on
A.2. DROCC–LF Algorithm the surface of a sphere of varying radius. Table 6 shows AUC
values with varying radius.
See Algorithm Box 2.

B. Synthetic Experiments B.2. Spherical Manifold


B.1. 1-D Sine Manifold OC-SVM and DeepSVDD try to find a minimum enclos-
ing ball for the whole set of positive points, while DROCC
In Section 5.1.1 we presented results on a synthetic dataset
assumes that the true points low on a low dimensional man-
of 1024 points sampled from a 1-D sine wave (See Figure
ifold. We now test these methods on a different synthetic
1a). We compare DROCC to other anomaly detection meth-
dataset: spherical manifold where the positive points are
ods by plotting the decision boundaries on this same dataset.
within a sphere, as shown in Figure 4a. Normal/Positive
Figure 5 shows the decision boundary for a) DROCC b)
points are sampled uniformly from the volume of the unit
OC-SVM with RBF kernel c) OC-SVM with 20-degree
sphere. Table 6 compares DROCC against various base-
polynomial kernel d) DeepSVDD. All methods are trained
lines when the OOD points are sampled on the surface of
only on positive points from the 1-D manifold.
a sphere of varying radius (See Figure 4b). DROCC again
We further evaluate these methods for varied sampling of outperforms all the baselines even in the case when mini-
negative points near the positive manifold. Negative points mum enclosing ball would suit the best. Suppose instead
are sampled from a 1-D sine manifold vertically displaced in of neural networks, we were operating with purely linear
both directions (See Figure 6). Table 7 compares DROCC models, then DROCC also essentially finds the minimum
against various baselines on this dataset. enclosing ball (for a suitable radius r). If r is too small,
DROCC: Deep Robust One-Class Classification

Table 6. Average AUC for Spherical manifold experiment (Section B.2). Normal points are sampled uniformly from the volume of a unit
sphere and OOD points are sampled from the surface of a unit sphere of varying radius (See Figure 4b). Again DROCC outperforms all
the baselines when the OOD points are quite close to the normal distribution.
Nearest
Radius OC-SVM AutoEncoder DeepSVDD DROCC (Ours)
Neighbor
1.2 100 ± 0.00 92.00 ± 0.00 91.81 ± 2.12 93.26 ± 0.91 99.44 ± 0.10
1.4 100 ± 0.00 92.97 ± 0.00 97.85 ± 1.41 98.81 ± 0.34 99.99 ± 0.00
1.6 100 ± 0.00 92.97 ± 0.00 99.92 ± 0.11 99.99 ± 0.00 100.00 ± 0.00
1.8 100 ± 0.00 91.87 ± 0.00 99.98 ± 0.04 100.00 ± 0.00 100.00 ± 0.00
2.0 100 ± 0.00 91.83 ± 0.00 100 ± 0.00 100.00 ± 0.00 100.00 ± 0.00

Table 7. Average AUC for the synthetic 1-D Sine Wave manifold experiment (Section B.1). Normal points are sampled from a sine wave
and OOD points from a vertically displaced manifold (See Figure 6). The results demonstrate that only DROCC is able to capture the
manifold tightly
Vertical Nearest
OC-SVM AutoEncoder DeepSVDD DROCC (Ours)
Displacement Neighbor
0.2 100 ± 0.00 56.99 ± 0.00 52.48 ± 1.15 65.91 ± 0.64 96.80 ± 0.65
0.4 100 ± 0.00 68.84 ± 0.00 58.59 ± 0.61 78.18 ± 1.67 99.31 ± 0.80
0.6 100 ± 0.00 76.95 ± 0.00 66.59 ± 1.21 82.85 ± 1.96 99.92 ± 0.11
0.8 100 ± 0.00 81.73 ± 0.00 77.42 ± 3.62 86.26 ± 1.69 99.98 ± 0.01
1.0 100 ± 0.00 88.18 ± 0.00 86.14 ± 2.52 90.51 ± 2.62 100 ± 0.00
2.0 100 ± 0.00 98.56 ± 0.00 100 ± 0.00 100 ± 0.00 100 ± 0.00

Displacement = 1.00

(a) (b)

Figure 6. Illustration of the negative points sampled at various dis-


placements of the sine wave; used for reporting the AUC values in
the Table 7. In this figure, vertical displacement is 1.0. Blue repre-
sents the positive points (also the training data) and red represents
the negative/OOD points
(c) (d)
Table 8. Ablation Study on CIFAR-10: Sampling negative points
Figure 5. (a) Decision boundary of DROCC trained only on the
randomly in the set Ni (r) (DROCC–Rand) instead of gradient
positive points lying on the 1-D sine manifold in Figure 1a. Blue
ascent (DROCC).
represents points classified as normal and red classified as abnor- One-Class
mal. (b) Decision boundary of classical OC-SVM using RBF CIFAR Class DROCC DROCC–Rand
Deep SVDD
kernel and same experiment settings as in (a). Yellow sine wave Airplane 61.7±4.1 81.66 ± 0.22 79.67 ± 2.09
Automobile 65.9±2.1 76.74 ± 0.99 73.48 ± 1.44
just shows the underlying train data. (c) Decision boundary of clas- Bird 50.8±0.8 66.66 ± 0.96 62.76 ± 1.59
sical OC-SVM using a 20-degree polynomial kernel. (d) Decision Cat 59.1±1.4 67.13 ± 1.51 67.33 ± 0.72
boundary of DeepSVDD. Deer 60.9±1.1 73.62 ± 2.00 56.09 ± 1.19
Dog 65.7±2.5 74.43 ± 1.95 65.88 ± 0.64
Frog 67.7±2.6 74.43 ± 0.92 74.82 ± 1.77
Horse 67.3±0.9 71.39 ± 0.22 62.08 ± 2.03
Ship 75.9±1.2 80.01 ± 1.69 80.04 ± 1.71
Truck 73.1±1.2 76.21 ± 0.67 70.80 ± 2.73

the training doesn’t converge since there is no separating C. LFOC Supplementary Experiments
boundary). Assuming neural networks are implicitly regu-
larized to find the simplest boundary, DROCC with neural In Section 5.2.1, we compared DROCC–LF with various
networks also learns essentially a minimum enclosing ball baselines for the OCLN task where the goal is to learn a
in this case, however, at a slightly larger radius. Therefore, classifier that is accurate for both the positive class and
we get 100% AUC only at radius 1.6 rather than 1 +  for the arbitrary OOD negatives. Figure 9 compares the recall
some very small . obtained by different methods on 2 keywords ”Forward”
DROCC: Deep Robust One-Class Classification

CIFAR-10 Airplane: AUC vs Radius CIFAR-10 Deer: AUC vs Radius


CIFAR-10 Truck: AUC vs Radius
90 90
90

80 80
80

70
AUC

AUC

AUC
70 70

60
60 60

50
50 50
0 5 10 15 20 10 20 30 40 10 20 30 40
Radius Radius Radius

(a) (b) (c)

Figure 7. Ablation Study : Variation in the performance DROCC when r (with γ = 1) is changed from the optimal value.

CIFAR-10 Airplane: AUC vs µ CIFAR-10 Deer: AUC vs µ CIFAR-10 Truck: AUC vs µ


90 90 90

80 80 80
AUC

AUC

AUC
70 70 70

60 60 60

50 50 50
0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4
µ µ µ

(a) (b) (c)

Figure 8. Ablation Study : Variation in the performance of DROCC with µ (1) which is the weightage given to the loss from adversarially
sampled negative points

Table 9. Synthesized near-negatives for keywords in Audio Com- Table 11. Hyperparameters: CIFAR-10
mands Adversarial
Learning
Marvin Forward Seven Follow Class Radius µ Optimizer Ascent
Rate
mar for one fall Step Size
marlin fervor eleven fellow Airplane 8 1 Adam 0.001 0.001
arvin ward heaven low Automobile 8 0.5 SGD 0.001 0.001
marvik reward when hollow Bird 40 0.5 Adam 0.001 0.001
arvi onward devon wallow Cat 28 1 SGD 0.001 0.001
Deer 32 1 SGD 0.001 0.001
Dog 24 0.5 SGD 0.01 0.001
Frog 36 1 SGD 0.001 0.01
Horse 32 0.5 SGD 0.001 0.001
Ship 28 0.5 SGD 0.001 0.001
Truck 16 0.5 SGD 0.001 0.001
Table 10. Hyperparameters: Tabular Experiments
Adversarial
Learning
Dataset Radius µ Optimizer Ascent
Rate
Step Size
Abalone 3 1.0 Adam 10−3 0.01 D. Ablation Study
Arrhythmia 16 1.0 Adam 10−4 0.01
Thyroid 2.5 1.0 Adam 10−3 0.01 D.1. Hyper-Parameters
Here we analyze the effect of two important hyper-
parameters — radius r of the ball outside, which we sam-
ple negative points (set Ni (r)), and µ which is the weigh-
and ”Follow” with 2 different FPR. Table 9 lists the close tage given to the loss from adversarially generated negative
negatives which were synthesized for each of the keywords. points (See Equation 1). We set γ = 1 and hence recall that
DROCC: Deep Robust One-Class Classification

Table 13. Hyperparameters: Timeseries Experiments


Adversarial
Learning
Dataset Radius µ Optimizer Ascent
Rate
Step Size
Epilepsy 10 0.5 Adam 10−5 0.1
Audio
16 1.0 Adam 10−3 0.1
Commands

Table 14. Hyperparameters: LFOC Experiments


Adversarial
Learning
Keyword Radius µ Optimizer Ascent
Rate
Step Size
Figure 9. OCLN on Audio Commands: Comparison of Recall for Marvin 32 1 Adam 0.001 0.01
key words Forward and Follow when the False Positive Rate(FPR) Seven 36 1 Adam 0.001 0.01
Forward 40 1 Adam 0.001 0.01
is fixed to be 3% and 5%. Follow 20 1 Adam 0.0001 0.01

Table 12. Hyperparameters: ImageNet


Adversarial
Learning
Class Radius µ Optimizer Ascent
Rate
Step Size
Table 8 shows the drop in performance when negative points
Tench 30 1 SGD 0.01 0.001 are sampled randomly on the CIFAR-10, hence emphasizing
English springer 16 1 SGD 0.001 0.001
Cassette player 40 1 Adam 0.005 0.001
the importance of gradient ascent-descent technique. Since
Chain saw 20 1 SGD 0.01 0.001 Ni (r) is high dimensional, random sampling does not find
Church 40 1 Adam 0.01 0.001
French horn 20 1 SGD 0.05 0.001
points close enough to manifold of positive points.
Garbage truck 30 1 Adam 0.005 0.001
Gas pump 30 1 Adam 0.01 0.001
Golf ball 30 1 SGD 0.01 0.001 E. Experiment details and Hyper-Parameters
Parachute 12 1 Adam 0.001 0.001
for Reproducibility
E.1. Tabular Datasets
the negative points are sampled to be at a distance of r from Following previous work, we use a base network consisting
the positive points. of a single fully-connected layer with 128 units for the
Figure 7a, 7b and 7c show the performance of DROCC with deep learning baselines. For the classical algorithms, the
varied values of r on the CIFAR-10 dataset. The graphs features are input to the model. Table 10 lists all the hyper-
demonstrate that sampling negative points quite far from the parameters for reproducibility.
manifold (setting r to be very large), causes a drop in the
accuracy since now DROCC would be covering the normal E.2. CIFAR-10
data manifold loosely causing high false positives. At the
DeepSVDD uses the representations learnt in the penulti-
other extreme, if the radius is set too small, the decision
mate layer of LeNet (LeCun et al., 1998) for minimizing
boundary could be too close to the positive and hence lead
their one-class objective. To make a fair comparison, we
to overfitting and difficulty in training the neural network.
use the same base architecture. However, since DROCC for-
Hence, setting an appropriate radius value is very critical
mulates the problem as a binary classification task, we add a
for the good performance of DROCC.
final fully connected layer over the learned representations
Figure 8a, 8b and 8c show the effect of µ on the performance to get the binary classification scores. Table 11 lists the
of DROCC on CIFAR-10. hyper-parameters which were used to run the experiments
on the standard test split of CIFAR-10.
D.2. Importance of gradient ascent-descent technique
E.3. ImageNet-10
In the Section 3 we formulated the DROCC’s optimiza-
tion objective as a saddle point problem (Equation 1). We MobileNetv2 (Sandler et al., 2018a) was used as the base
adopted the standard gradient descent-ascent technique to architecture for DeepSVDD and DROCC. Again we use the
solve the problem replacing the `p ball with Ni (r). Here, representations from the penultimate layer of MobileNetv2
we present an analysis of DROCC without the gradient as- for optimizing the one-class objective of DeepSVDD. The
cent part i.e., we now sample points at random in the set of width multiplier for MobileNetv2 was set to be 1.0. Table 12
negatives Ni (r). We call this formulation as DROCC–Rand. lists all the hyper-parameters.
DROCC: Deep Robust One-Class Classification

E.4. Time Series Datasets


To keep the focus only on comparing DROCC against the
baseline formulations for OOD detection, we use a single
layer LSTM for all the experiments on Epileptic Seizure
Detection, and the Audio Commands dataset. The hidden
state from the last time step is used for optimizing the one
class objective of DeepSVDD. For DROCC we add a fully
connected layer over the last hidden state to get the binary
classification scores. Table 13 lists all the hyper-parameters
for reproducibility.

E.5. LFOC Experiments on Audio Commands


For the Low-FPR classification task, we use keywords from
the Audio Commands dataset along with some synthesized
near-negatives. The training set consists of 1000 examples
of the keyword and 2000 randomly sampled examples from
the remaining classes in the dataset. The validation and
test set consist of 600 examples of the keyword, the same
number of words from other classes of Audio Commands
dataset and an extra synthesized 600 examples of close
negatives of the keyword (see Table 9) A single layer LSTM,
along with a fully connected layer on top on the hidden state
at last time step was used. Similar to experiments with
DeepSVDD, DeepSAD uses the hidden state of the final
timestep as the representation in the one-class objective. An
important aspect of training DeepSAD is the pretraining of
the network as the encoder in an autoencoder. We also tuned
this pretraining to ensure the best results.

You might also like