2020 BO and Adversarial Attacks
2020 BO and Adversarial Attacks
Amartya Sanyal∗1,2 , Puneet K. Dokania†3,4 , Varun Kanade‡1,2 , and Philip H.S. Torr§3
1
Department of Computer Science, University of Oxford
2
The Alan Turing Institute
3
Department of Engineering Science, University of Oxford
4
Five AI Ltd., UK
arXiv:2007.04028v1 [cs.LG] 8 Jul 2020
Abstract
We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly)
trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even
in the presence of label noise, while also exhibiting good generalization on natural test data, something
referred to as benign overfitting [2, 10]. However, these models are vulnerable to adversarial attacks.
We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and
empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets
such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e.
they don’t fit the noise. However, removing noisy labels alone does not suffice to achieve adversarial
robustness. Standard training procedures bias neural networks towards learning “simple” classification
boundaries, which may be less robust than more complex ones. We observe that adversarial training does
produce more complex decision boundaries. We conjecture that in part the need for complex decision
boundaries arises from sub-optimal representation learning. By means of simple toy examples, we show
theoretically how the choice of representation can drastically affect adversarial robustness.
1 Introduction
Modern machine learning methods achieve a very high accuracy on wide range of tasks, e.g. in computer
vision, natural language processing, etc. [28, 17, 20, 60, 55, 45], but especially in vision tasks, they have
been shown to be highly vulnerable to small adversarial perturbations that are imperceptible to the human
eye [12, 7, 51, 16, 8, 42, 38]. This vulnerability poses serious security concerns when these models are deployed
in real-world tasks (cf. [30, 29, 43, 49, 24, 32]). A large body of research has been devoted to crafting defences
to protect neural networks from adversarial attacks (e.g. [16, 41, 11, 57, 22, 9, 53, 36, 63]). However, such
defences have usually been broken by future attacks [1, 52]. This arms race between attacks and defences
suggests that to create a truly robust model would require a deeper understanding of the source of this
vulnerability.
Our goal in this paper is not to propose new defences, but to provide better answers to the question: what
causes adversarial vulnerability? In doing so, we also seek to understand how existing methods designed to
achieve adversarial robustness overcome some of the hurdles pointed out by our work. We identify two sources
of vulnerability that, to the best of our knowledge, have not been properly studied before: a) memorization
of label noise, and b) the implicit bias in the decision boundaries of neural networks trained with stochastic
gradient descent (SGD).
∗ [email protected].
† [email protected]
‡ [email protected].
§ [email protected]
1
CIFAR10 MNIST
Figure 1: Label Noise in CIFAR10 and MNIST. Text above the image indicates the training set label.
First, in the case of label noise, starting with the celebrated work of Zhang et al. [62] it has been observed
that neural networks trained with SGD are capable of memorizing large amounts of label noise. Recent
theoretical work (e.g. [34, 4, 3, 19, 5, 6, 2, 39, 10]) has also sought to explain why fitting training data perfectly
(also referred to as memorization or interpolation) does not lead to a large drop in test accuracy, as the classical
notion of overfitting might suggest. We show through simple theoretical models, as well as experiments,
that there are scenarios where label noise does cause significant adversarial vulnerability, even when high
natural (test) accuracy can be achieved. Surprisingly, we find that label noise is not at all uncommon in
datasets such as MNIST and CIFAR-10 (see Figure 1). Our experiments show that robust training methods
like Adversarial training (AT) [36] and TRADES [63] produce models that incur training error on at least
some of the noisy examples,1 but also on atypical examples from the classes. Viewed differently, robust
training methods are unable to differentiate between atypical correctly labelled examples (rare dog) and
a mislabelled example (cat labelled as dog) and end up not memorizing either; interestingly, the lack of
memorizing these atypical examples has been pointed out as an explanation for slight drops in test accuracy,
as the test set often contains similarly atypical (or even identical) examples in some cases [14, 61].
Second, the fact that adversarial learning may require more “complex” decision boundaries, and as a
result may require more data has been pointed out in some prior work [48, 59, 40, 36]. However, the question
of decision boundaries in neural networks is subtle as the network learns a feature representation as well as a
decision boundary on top of it. We develop theoretical examples that establish that choosing one feature
representation over another may lead to visually more complex decision boundaries on the input space,
though these are not necessarily more complex in terms of statistical learning theoretic concepts such as VC
dimension. One way to evaluate whether more meaningful representations lead to better robust accuracy is
to use training data with more fine-grained labels (e.g. subclasses of a class); for example, one would expect
that if different breeds of dogs are labelled differently the network will learn features that are relevant to that
extra information. We show both using synthetic data and CIFAR100 that training on fine-grained labels
does increase robust accuracy.
Tsipras et al. [54] and Zhang et al. [63] have argued that the trade-off between robustness and accuracy
might be unavoidable. However, their setting involves a distribution that is not robustly separable by any
classifier. In such a situation there is indeed a trade-off between robustness and accuracy. In this paper, we
focus on settings where robust classifiers exist, which is a more realistic scenario for real-world data. At least
for vision, one may well argue that “humans” are robust classifiers, and as a result we would expect that
classes are well-separated at least in some representation space. In fact, Yang et al. [58] show that classes
are already well-separated in the input space. In such situations, there is no need for robustness to be at
odds with accuracy. A more plausible scenario which we posit, and provide theoretical examples in support
of, is that the trained models may not be using the “right” representations. Recent empirical work has also
established that modifying the training objective to favour certain properties in the learned representations
can automatically lead to improved robustness [46].
1 We manually inspected all training set errors of these models.
2
Summary of Theoretical Contributions
1. We provide simple sufficient conditions on the data distribution under which any classifier that fits the
training data with label noise perfectly is adversarially vulnerable.
2. The choice of the representation (and hence the shape of the decision boundary) can be important for
adversarial accuracy even when it doesn’t affect natural test accuracy.
3. There exists data distributions and training algorithms, which when trained with (some fraction of)
random label noise have the following property: (i) using one representation, it is possible to have high
natural and robust test accuracies but at the cost of having training error; (ii) using another representation,
it is possible to have no training error (including fitting noise) and high test accuracy, but low robust
accuracy. Furthermore, any classifier that has no training error must have low robust accuracy.
The last example shows that the choice of representation matters significantly when it comes to adversarial
accuracy, and that memorizing label noise directly leads to loss of robust accuracy. The proofs of the results
are not technically complicated and are included in the supplementary material. We have focused on making
conceptually clear statements rather than optimize the parameters to get the best possible bounds. We also
perform experiments on synthetic data (motivated by the theory), as well as MNIST, CIFAR10/100 to test
these hypotheses.
Summary of Experimental Contributions
1. As predicted theoretically, neural nets trained to convergence with label noise have greater adversarial
vulnerability.
2. Robust training methods, such as AT and TRADES that have higher robust accuracy, avoid overfitting
(some) label noise. This behaviour is also partly responsible for their decrease in natural test accuracy.
3. Even in the absence of any label noise, methods like AT and TRADES have higher robust accuracy due
to more complex decision boundaries.
4. When trained with more fine-grained labels, subclasses within each class, leads to higher robust accuracy.
2 Theoretical Setting
We develop a simple theoretical framework to demonstrate how overfitting, even very minimal, label noise
causes significant adversarial vulnerability. We also show how the choice of representation can significantly
affect robust accuracy. Although we state the results for binary classification, they can easily be generalized
to multi-class problems. We formally define the notions of natural (test) error and adversarial error.
Definition 1 (Natural and Adversarial Error). For any distribution D defined over (x, y) ∈ Rd × {0, 1} and
any binary classifier f : Rd → {0, 1},
• if Bγ (x) is a ball of radius γ ≥ 0 around x under some norm2 , the γ-adversarial error is
In the rest of the section, we provide theoretical results to show the effect of overfitting label noise and
choice of representations (and hence simplicity of decision boundaries) on the robustness of classifiers.
2 Throughout, we will mostly use the (most commonly used) `∞ norm, but the results hold for other norms.
3
2.1 Overfitting Label Noise
The following result provides a sufficient condition under which even a small amount of label noise causes
any classifier that fits the training data perfectly to have significant adversarial error. Informally, Theorem 1
states that if the data distribution has significant probability mass in a union of (a relatively small number
of, and possibly overlapping) balls, each of which has roughly the same probability mass (cf. Eq. (3)), then
even a small amount of label noise renders this entire region vulnerable to adversarial attacks to classifiers
that fit the training data perfectly.
Theorem 1. Let c be the target classifier, and let D be a distribution over (x, y), such that y = c (x) in its
support. Using the notation PD [A] to denote P(x,y)∼D [x ∈ A] for any measurable subset A ⊆ Rd , suppose that
there exist c1 ≥ c2 > 0, ρ > 0, and a finite set ζ ⊂ Rd satisfying
[ c2
PD Bρp (s) ≥ c1 and ∀s ∈ ζ, PD Bρp (s) ≥
(3)
|ζ|
s∈ζ
where Bρp (s) represents a `p -ball of radius ρ around s. Further, suppose that each of these balls contain points
from a single class i.e. for all s ∈ ζ, for all x, z ∈ Bρp (s) : c (x) = c (z).
Let Sm be a dataset of m i.i.d. samples drawn from D, which subsequently has each label flipped
independently with probability η. For any classifier f that perfectly fits the training data Sm i.e. ∀ x, y ∈
|ζ| |ζ|
Sm , f (x) = y, ∀δ > 0 and m ≥ ηc 2
log δ , with probability at least 1 − δ, RAdv,2ρ (f ; D) ≥ c1 .
The goal is to find a relatively small set ζ that satisfies the condition as this will mean that even for
modest sample sizes, the trained models have significant adversarial error. We remark that it is easy to
construct concrete instantiations of problems that satisfy the conditions of the theorem, e.g. each class
represented by a spherical (truncated) Gaussian with radius ρ, with the classes being well-separated satisfies
Eq. (3). The main idea of the proof is that there is sufficient probability mass for points which are within
distance 2ρ of a training datum that was mislabelled. We note that the generality of the result, namely
that any classifier (including neural networks) that fits the training data must be vulnerable irrespective
of its structure, requires a result like Theorem 1. For instance, one could construct the classifier h, where
h(x) = c(x), if (x, b) 6∈ Sm for b = 0, 1, and h(x) = y if (x, y) ∈ Sm . Note that the classifier h agrees with the
target c on every point of Rd except the mislabelled training examples, and as a result these examples are
the only source of vulnerability. The complete proof is presented in Appendix A.1.
There are a few things to note about Theorem 1. First, the lower bound on adversarial error applies to
any classifier f that fits the training data Sm perfectly and is agnostic to the type of model f is. Second,
for a given c1 , there maybe multiple ζs that satisfy the bounds in (3) and the adversarial risk holds for all
of them. Thus, smaller the value of |ζ| the smaller the size of the training data it needs to fit and it can
be done by simpler classifiers. Third, if the distribution of the data is such that it is concentrated around
some points then for a fixed c1 , c2 , a smaller value of ρ would be required to satisfy (3) and thus a weaker
adversary (smaller perturbation budget 2ρ) can cause a much larger adversarial error.
In practice, classifiers exhibit much greater vulnerability than purely arising from the presence of memorized
noisy data. Experiments in Section 3.1 show how label noise causes vulnerability in a toy MNIST model, as
well as the full MNIST.
4
Parity
Classifier
Union of
Intervals Parity Classifier
Linear Classifier
001 010 011 100
(a) Both Parity and Union of Interval classifier predicts red if inside any green (b) Robust generalization needs
interval and blue if outside all intervals. The ×-es are correctly labelled and more complex boundaries
the ◦-es are mis-labelled points. Reference integer points on the line labelled in
binary.
Figure 2: Visualization of the distribution and classifiers used in the Proof of Theorem 2 and 3. The Red and
Blue indicate the two classes.
training image in the training set are permuted with a fixed permutation [62]. This invariance is worrying as
it means that such a network can effectively classify a matrix (or tensor) that is visually nothing like a real
image into an image category. While CNNs don’t have this particular invariance, as Liu et al. [35] shows,
location invariance in CNNs mean that they are unable to predict where in the image a particular object is.
In particular, it may be that the decision boundary for robust classifiers needs to be “visually” more
complex as pointed out in prior work [40], but we emphasize that this may be because of the choice of
representation, and in particular in standard measures of statistical complexity, such as VC dimension, this
may not be the case. We demonstrate this phenomenon by a simple (artificial) example even when there
is no label noise. Our example in Section 2.3 combines the two causes and shows how classifiers that are
translation invariant may be worse for adversarial robustness.
√
Theorem 2. For some universal constant c, and any 0 < γ0 < 1/ 2, there exists a family of distributions
D defined on X × {0, 1} where X ⊆ R2 such that for all distributions P ∈ D, and denoting by Sm =
{(x1 , y1 ) , · · · , (xm , ym )} a sample of size m drawn i.i.d. from P,
(i) For any m ≥ 0, Sm is linearly separable i.e., ∀(xi , yi ) ∈ Sm , there exist w ∈ R2 , w0 ∈ R s.t. yi w> xi + w0 ≥
0. Furthermore, for every γ > γ0 , any linear separator f that perfectly fits the training data Sm has
RAdv,γ (f ; P) ≥ 0.0005, even though R(f ; P) → 0 as m → ∞.
(ii) There exists a function class H such that for some m ∈ O(log(δ −1 )), any h ∈ H that perfectly fits the Sm ,
satisfies with probability at least 1 − δ, R(h; P) = 0 and RAdv,γ (h; P) = 0, for any γ ∈ [0, γ0 + 1/8].
A complete proof of this result appears in Appendix A.2, but first, we provide√a sketch of the key idea
here.The distributions in family D will be supported on balls of radius at most 1/ 2 on the integer lattice
in R2 . The true class label for any point x is provided by the parity of a + b, where (a, b) is the lattice
point closest to x. However, the distributions in D are chosen to be such that there is also a linear classifier
that can separate these classes, e.g. a distribution only supported on balls centered at the points (a, a) and
(a, a + 1) for some integer a (See Figure 2b). Visually learning the classification problem using the parity of
a + b results in a seemingly more complex decision boundary, a point that has been made earlier regarding
the need for more complex boundaries to achieve adversarial robustness [40, 13]. However, it is worth noting
that this complexity is not rooted in any statistical theory, e.g. the VC dimension of the classes considered
in Theorem 2 is essentially the same (even lower for H by 1). This visual complexity arises purely due to
the fact that the linear classifier looks at a geometric representation of the data whereas the parity classifier
looks at the binary representation of the sum of the nearest integer of the coordinates. In the case of neural
networks, recent works [26] have indeed provided empirical results to support that excessive invariance (eg.
rotation invariance) increases adversarial error.
5
Figure 3: Adversarial error increases with label noise (η) if training error is 0. Shaded region shows 95%
confidence interval.
perfectly, but the classifier that best fits the training data,3 will have good test accuracy and adversarial
accuracy. However, using an “incorrect” representation, we show that it is possible to find a classifier that
has no training error, has good test accuracy, but has high adversarial error. We posit this as an (partial)
explanation of why classifiers trained on real data (with label noise, or at least atypical examples) have good
test accuracy, while still being vulnerable to adversarial attacks.
Theorem 3. [Formal version of Theorem 3] For any n ∈ Z+ , there exists a family of distributions Dn over
R × {0, 1} and function classes C, H, such that for any P from Dn , and for any 0 < γ < 1/4, and η ∈ (0, 1/2)
if Sm = {(xi , yi )}m
i=1 denotes a sample of size m where
( ! )!
n (1 − η) n n
m = O max n log 2 + 1 , 2 log
δ (1 − 2η) ηγ γδ
drawn from P, and if Sm,η denotes the sample where each label is flipped independently with probability η.
(i) the classifier c ∈ C that minimizes the training error on Sm,η , has R(c; P) = 0 and RAdv,γ (c; P) = 0 for
0 ≤ γ < 1/4.
(ii) there exist h ∈ H, h has zero training error on Sm,η , and R(h; P) = 0. However, for any γ > 0, and for
any h ∈ H with zero training error on Sm,η , RAdv,γ (h; P) ≥ 0.1.
1
Furthermore, the required c ∈ C and h ∈ H above can be computed in O poly (n) , poly 1 −η , poly 1δ
2
time.
We sketch the proof here and present the complete the proof in Appendix B; as in Section 2.2 we will make
use of parity functions, though the key point is the representations used. Let X = [0, N ], where N = 2n , we
consider distributions that are supported on intervals (i − 1/4, i + 1/4) for i ∈ {1, . . . , N − 1} (See Figure 2a),
but any such distribution will only have a small number, O(n), intervals on which it is supported. The
true class label is given by a function that depends on the parity of some hidden subsets S of bits in the
bit-representation of the closest integer i, e.g. as in Figure 2a if S = {0, 2}, then only the least significant
and the third least significant bit of i are examined and the class label is 1 if an odd number of them are 1
and 0 otherwise. Despite the noise, the correct label on any interval can be guessed by using the majority
vote and as a result, the correct parity learnt using Gaussian elimination. (This corresponds to the class
C in Theorem 3.) On the other hand it is also possible to learn the function as a union of intervals, i.e.
find intervals, I1 , I2 , . . . , Ik such that any point that lies in one of these intervals is given the label 1 and
any other point is given the label 0. By choosing intervals carefully, it is possible to fit all the training data,
including noisy examples, but yet not compromise on test accuracy (Fig. 2a). Such a classifier, however, will
be vulnerable to adversarial examples by applying Theorem 1. A classifier such as union of intervals (H in
3 This is referred to as the Empirical Risk Minimization (ERM) in the statistical learning theory literature.
6
(a) = 0.01 (b) = 0.025 (c) = 0.05 (d) = 0.1 (e) = 0.2
Figure 4: Shows the adversarial for the full MNIST dataset for varying levels of adversarial perturbation.
There is negligible variance between runs and thus the shaded region showing the confidence interval is
invisible.
Theorem 3) is translation-invariant, whereas the parity classifier is not. This suggests that using classifiers,
such as neural networks, that are designed to have too many built-in invariances might hurt its robustness
accuracy.
3 Experimental results
In Section 2, we provided three theoretical settings to highlight how fitting label noise and sub-optimal
representation learning (leading to seemingly simpler decision boundaries) hurts adversarial robustness. In
this section, we provide empirical evidence on synthetic data inspired by the theory and on the standard
datasets: MNIST [31], CIFAR10, and CIFAR100 [27] to support the theory.
7
Figure 5: Two dimensional PCA projections of the original correctly labelled (blue and orange), original
mis-labelled (green and red), and adversarial examples (purple and brown) at different stages of training.
The correct label for True 0 (blue), Noisy 0 (green), Adv 0 (purple +) are the same i.e. 0 and similar for the
other class.
we assigned the class label randomly. The network is optimized with SGD with a batch size of 128, learning
rate of 0.1 for 60 epochs and the learning rate is decreased to 0.01 after 50 epochs.
We compute the natural test accuracy and the adversarial test accuracy for when the network is attacked
with a `∞ bounded PGD adversary for varying perturbation budget , with a step size of 0.01 and for 20
steps. Figure 4 shows that the effect of over-fitting label noise is even more clearly visible here; for the same
PGD adversary the adversarial error jumps sharply with increasing label noise, while the growth of natural
test error is much slower.
Visualizing through low-dimensional projections: For the toy-MNIST problem, we plot a 2-d
projection (using PCA) of the learned representations (activations before the last layer) at various stages
of training in Figure 5. (We remark that the simplicity of the data model ensures that even a 1-d PCA
projection suffices to perfectly separate the classes when there is no label noise; however, the representations
learned by a neural network in the presence of noise maybe very different!) We highlight two key observations:
(i) The bulk of adversarial examples (“+”-es) are concentrated around the mis-labelled training data (“◦”-es)
of the opposite class. For example, the purple +-es (Adversarially perturbed: True: 0, Pred:1 ) are very close
to the green ◦-es (Mislabelled: True:0, Pred: 1). This provides empirical validation for the hypothesis that if
there is a mis-labelled data-point in the vicinity that has been fit by the model, an adversarial example can
be created by moving towards that data point as predicted by Theorem 1. (ii) The mis-labelled training
data take longer to be fit by the classifier. For example by iteration 20, the network actually learns a fairly
good representation and classification boundary that correctly fits the clean training data (but not the noisy
training data). At this stage, the number of adversarial examples are much lower as compared to Iteration
160, by which point the network has completely fit the noisy training data. Thus early stopping helps in
avoiding memorizing the label noise, but consequently also reduces adversarial vulnerability. Early stopping
has indeed been used as a defence in quite a few recent papers in context of adversarial robustness [56, 23],
as well as learning in the presence of label-noise [33]. Our work provides an explanation regarding why early
stopping may reduce adversarial vulnerability by avoiding fitting noisy training data.
8
Train-Acc. (%) Test-Acc (%)
0.0 99.98 95.25
0.25 97.23 92.77
1.0 86.03 81.62
Table 1: Train and test accuracies on clean dataset for ResNet-50 models trained using `2 adversaries of
perturbation . The = 0 setting represents the natural training.
CIFAR10 MNIST
Figure 6: Each pair is a training (left) and test (right) image mis-classified by the adversarially trained model.
They were both correctly classified by the naturally-trained model.
Experiments on MNIST and CIFAR10 We demonstrate this effect in Figure 6 with examples from
CIFAR10 and MNIST. Each pair of images contains a mis-classified (by robustly trained models) test image
9
PLANES CAR BIRD CAT DEER
Figure 7: Fraction of train points that have a self-influence greater than s is plotted versus s. The blue line
represents the points mis-classified by an adversarially trained model on CIFAR10. The orange lines shows
the distribution of self-influence for all points in the CIFAR10 dataset (of the concerned class).
and the mis-classified training image “responsible” for it (We describe below how they were identified.).
Importantly both of these images were correctly classified by a naturally trained model. Visually, it is evident
that the training images are extremely similar to the corresponding test image. Inspecting the rest of the
training set, they are also very different from other images in the training set. We can thus refer to these as
rare sub-populations.
The notion that certain test examples were not classified correctly due to a particular training examples
not being classified correctly is measured by the influence a training image has on the test image (c.f. defn
3 in Zhang and Feldman [61]). Intuitively, it measures the probability that a certain test example would
be classified correctly if the model were learned using a training set that did not contain the training point
compared to if the training set did contain that particular training point. We obtained the influence of
each training image on each test image for that class from Zhang and Feldman [61]. We found the images
in Figure 6 by manually searching for each test image, the training image that is misclassified and is visually
close to it. Our search space was shortened with the help of the influence scores each training image has on
the classification of a test image. We searched in the set of top-10 most influential mis-classified train images
for each mis-classified test image. The model used for Figure 6 is a AT model for CIFAR10 with `2 -adversary
with an = 0.25 and a model trained with TRADES for MNIST with λ = 16 and = 0.3.
A precise notion of measuring if a sample is rare is through the concept of self-influence. Self influence of
an example with respect to an algorithm (model, optimizer etc) can be defined as how unlikely it is for the
model learnt by that algorithm to be correct on an example if it had not seen that example during training
compared to if it had seen the example during training. For a precise mathematical definition please refer to
Eq (1) in Zhang and Feldman [61]. Self-influence for a rare example, that is unlike other examples of that
class, will be high as the rest of the dataset will not provide relevant information that will help the model
in correctly predicting on that particular example. In Figure 7, we show that the self-influence of training
samples that were mis-classified by adversarially trained models but correctly classified by a naturally trained
model is higher compared to the distribution of self-influence on the entire train dataset. In other words, it
means that the self-influence of the training examples mis-classified by the robustly trained models is larger
than the average self-influence of (all) examples belonging to that class. This supports our hypothesis that
adversarial training excludes fitting these rare (or ones that need to be memorized) samples.
10
AT NAT AT NAT
(a) Shallow Wide NN (b) Deep NN
Figure 8: Adversarial training (AT) leads to larger margin, and thus adversarial robustness around high
density regions (larger circles) but causes training error on low density sub-populations (smaller circles)
whereas naturally trained models (NAT) minimizes the training error but leads to regions with very small
margins.
layers and 1000 neurons in each layer (Shallow-Wide NN) and a deep network with 4 layers and 100 neurons
in each layer using cross entropy loss and SGD. The background color shows the decision region of the learnt
neural network. Figure 8 shows that the adversarially trained (AT) models ignore the smaller circles (i.e. rare
sub-populations) and tries to get a larger margin around the circles it does classify correctly whereas the
naturally trained (NAT) models correctly predicts every circle but ends up with very small margin around a
lot of circles.
11
(a) Shallow NN (b) Shallow-Wide NN (c) Deep NN (d) Large Margin
Figure 9: Decision boundaries of neural networks are much simpler than they should be.
same class. We test this hypothesis with two experiments. First, we test it on the the distribution defined
in Theorem 2 where for each ball with label 1, we assign it a different label (say α1 , · · · , αk ) and similarly
for balls with label 0, we assign it a different label (β1 , · · · , βk ). Now, we solve a multi-class classification
problem for 2k classes with a deep neural network and then later aggregate the results by reporting all αi s as
1 and all βi s as 0.The resulting decision boundary is drawn in Figure 10a along with the decision boundary
for natural training and AT. Clearly, the decision boundary for AT is the most complex and has the highest
margin (and robustness) followed by the multi-class model and then the naturally trained model.
Second, we also repeat the experiment with CIFAR-100. We train a ResNet50 [21] on the fine labels of
CIFAR100 and then aggregate the fine labels corresponding to a coarse label by summing up the logits. We
call this model the Fine2Coarse model and compare the adversarial risk of this network to a ResNet-50
trained directly on the coarse labels. Note that the model is end-to-end differentiable as the only addition
is a layer to aggregate the logits corresponding to the fine classes pertaining to each coarse class. Thus
PGD adversarial attacks can be applied out of the box. Figure 10b shows that for all perturbation budgets,
Fine2Coarse has smaller adversarial risk than the naturally trained model.
4 Related Work
[37] established that there are concept classes with finite VC dimensions i.e. are properly PAC-learnable but
are only improperly robustly PAC learnable. This implies that to learn the problem with small adversarial
error, a different class of models (or representations) needs to be used whereas for small natural test risk,
the original model class (or representation) can be used. Recent empirical works have also shown evidence
towards this (eg. [46]).
Hanin and Rolnick [18] have shown that though the number of possible linear regions that can be created
by a deep ReLU network is exponential in depth, in practice for networks trained with SGD this tends to
grow only linearly thus creating much simpler decision boundaries than is possible due to sheer expresssivity
of deep networks. Experiments on the data models from our theoretical settings indeed show that adversarial
training indeed produces more “complex” decision boundaries
Jacobsen et al. [25] have discussed that excesssive invariance in neural networks might increase adversarial
error. However, their argument is that excessive invariance can allow sufficient changes in the semantically
important features without changing the network’s prediction. They describe this as Invariance-based
adversarial examples as opposed to perturbation based adversarial examples. We show that excessive
(incorrect) invariance might also result in perturbation based adversarial examples.
Another contemporary work [15] discusses a phenomenon they refer to as Shortcut Learning where deep
learning models perform very well on standard tasks like reducing classification error but fail to perform in
more difficult real world situations. We discuss this in the context of models that have small test error but
large adversarial error and provide and theoretical and empirical to discuss why one of the reasons for this is
sub-optimal representation learning.
12
AT MULTICLASS NATURAL
(a) Decision Region of neural networks are more complex for
adversarially trained models. Treating it as a multi-class clas-
sification problem, with natural training (MULTICLASS), also
increases robustness by increasing the margin. (b) Adversarial error on coarse labels of CIFAR-100.
Figure 10: Assigning a separate class to each sub-population within the original class during training increases
robustness by learning more meaningful representations.
5 Conclusion
Recent research has largely shone a positive light on interpolation (zero training error) by highly over-
parameterized models even in the presence of label noise. While overfitting noisy data may not harm
generalisation, we have shown that this can be severely detrimental to robustness. This raises a new security
threat where label noise can be inserted into datasets to make the models learnt from them vulnerable to
adversarial attacks without hurting their test accuracy. As a result, further research into learning without
memorization is ever more important [47, 50]. Further, we underscore the importance of proper representation
learning in regards to adversarial robustness. Representations learnt by deep networks often encode a lot
of different invariances, e.g., location, permutation, rotation, etc. While some of them are useful for the
particular task at hand, we highlight that certain invariances can increase adversarial vulnerability. Thus we
believe that making significant progress towards training robust models with good test error requires us to
rethink representation learning and closely examine the data on which we are training these models.
6 Acknowledgement
We thank Vitaly Feldman and Chiyuan Zhang for providing us with data that helped to significantly speed
up some parts of this work. We also thank Nicholas Lord for feedback on the draft. AS acknowledges support
from The Alan Turing Institute under the Turing Doctoral Studentship grant TU/C/000023. VK is supported
in part by the Alan Turing Institute under the EPSRC grant EP/N510129/1. PHS and PD are supported by
the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI
grant EP/N019474/1. PHS and PD also acknowledges the Royal Academy of Engineering and FiveAI.
13
A Proofs for Section 2
In this section, we present the formal proofs to the theorems stated in Section 2.
where Bρp (s) represents a `p -ball of radius ρ around s. Further, suppose that each of these balls contain points
from a single class i.e. for all s ∈ ζ, for all x, z ∈ Bρp (s) : c (x) = c (z).
Let Sm be a dataset of m i.i.d. samples drawn from D, which subsequently has each label flipped
independently with probability η. For any classifier f that perfectly fits the training data Sm i.e. ∀ x, y ∈
|ζ| |ζ|
Sm , f (x) = y, ∀δ > 0 and m ≥ ηc2 log δ , with probability at least 1 − δ, RAdv,2ρ (f ; D) ≥ c1 .
|ζ| |ζ|
Substituting m ≥ ηc2 log δ and applying the union bound over all s ∈ ζ, we get
≥ c1 w.p. 1 − δ
14
where c is the true concept for the distribution D. The second equality follows from the assumptions that
each of the balls around s ∈ ζ are pure in their labels. The second last equality follows from (4) by using the
x that is guaranteed to exist in the ball around s and be mis-labelled with probability atleast 1 − δ. The last
equality from Assumption (4).
0. Furthermore, for every γ > γ0 , any linear separator f that perfectly fits the training data Sm has
RAdv,γ (f ; P) ≥ 0.0005, even though R(f ; P) → 0 as m → ∞.
(ii) There exists a function class H such that for some m ∈ O(log(δ −1 )), any h ∈ H that perfectly fits the Sm ,
satisfies with probability at least 1 − δ, R(h; P) = 0 and RAdv,γ (h; P) = 0, for any γ ∈ [0, γ0 + 1/8].
Proof of Theorem 2. We define a family of distribution D, such that each distribution in D is supported on
balls of radius r around (i, i) and (i + 1, i) for positive integers i. Either all the balls around (i, i) have the
labels 1 and the balls around (i + 1, i) have the label 0 or vice versa. Figure 2b shows an example where the
colors indicate the label.
Formally, for r > 0, k ∈ Z+ , the (r, k)-1 bit parity class conditional model is defined over (x, y) ∈ R2 ×{0, 1}
as follows. First, a label y is sampled uniformly from {0, 1}, then and integer i is sampled uniformly from the
set {1, · · · , k} and finally x is generated by sampling uniformly from the `2 ball of radius r around (i + y, i).
In Lemma 1 we first show that a set of m points sampled iid from any distribution as defined above for
1
r < 2√ 2
is with probability 1 linear separable for any m. In addition, standard VC bounds show that any
linear classifier that separates Sm for large enough m will have small test error. Lemma 1 also proves that
there exists a range of γ, r such that for any distribution defined with r in that range, though it is possible to
obtain a linear classifier with 0 training and test error, the minimum adversarial risk will be bounded from 0.
However while it is possible to obtain a linear classifier with 0 test error, all such linear classifiers has
a large adversarial vulnerability. In Lemma 2, we show that there exists a different representation for this
problem, which also achieves zero training and test error and in addition has zero adversarial risk for a range
of r, γ where the linear classifier’s adversarial error was atleast a constant.
Lemma 1 (Linear Classifier). There exists universal constants γ0 , ρ, such that for any perturbation γ > γ0 ,
radius r ≥ ρ, and k ∈ Z+ , the following holds. Let D be the family of (r, k)- 1-bit parity class conditional
model, P ∈ D and Sn = {(x1 , y1 ) , · · · , (xn , y1 )} be a set of n points sampled i.i.d. from P.
1) For any n > 0, Sn is linearly separable with probability 1 i.e. there exists a h : (w, w0 ), w ∈ R2 , w0 ∈ R
such that the linear hyperplane x → w> x + w0 separates Sn with probability 1:
2) Further there exists an universal constant c such that for any , δ > 0 with probability 1 − δ for any Sn
with n = c 12 log 1δ , any linear classifier h̃ that separates Sn has R(h̃; P) ≤ .
3) Let h : (w, w0 ) be any linear classifier that has R(h; PP ) = 0. Then, RAdv,γ (h; P) > 0.0005.
1
We will prove the first part for any r < 2√ 2
by constructing a w, w0 such that it satisfies the constraints
of linear separability. Let w = (1, −1) , w0 = −0.5. Consider any point (x, y) ∈ Sn and z = 2y −
1. Converting to the polar coordinate system there exists a θ ∈ [0, 2π] , j ∈ [0, · · · , k] such that x =
15
z+1
j+ 2+ rcos (θ) , j + rsin (θ)
z+1 >
z w> x + w0 = z j +
+ rcos (θ) − j − rsin (θ) − 0.5 w = (1, −1)
2
z
=z + 0.5 + rcos (θ) − rsin (θ) − 0.5
2
1 √
= + zr (cos (θ) − sin (θ)) |cos (θ) − sin (θ)| < 2, z ∈ {−1, 1}
2
1 √
> −r 2
2
1
>0 r< √
2 2
Part 2 follows with simple VC bounds of linear classifiers.
1
Let the universal constants γ0 , ρ be 0.02 and 2√ 2
− 0.008 respectively. Note that there is nothing special
about this constants except that some constant is required to bound the adversarial risk away from 0. Now,
consider a distribution P 1-bit parity model such that the radius of each ball is atleast ρ. This is smaller
1
than 2√ 2
and thus satisfies the linear separability criterion.
Consider h to be a hyper-plane that has 0 test error. Let the `2 radius of adversarial perturbation be
γ > γ0 . The region of each circle that will be vulnerable to the attack will be a circular segment with the
chord of the segment parallel to the hyper-plane. Let the minimum height of all such circular segments be r0 .
Thus, RAdv,γ (h; P) is greater than the mass of the circular segment of radius r0 . Let the radius of each ball
in the support of P be r.
Using the fact that h has zero test error; and thus classifies the balls in the support of P correctly and
simple geometry
1
√ ≥ r + (γ − r0 ) + r
2
1
r0 ≥ 2r + γ − √ (5)
2
To compute RAdv,γ (h; P) we need to compute the ratio of the area of a circular segment of height r0 of a
circle of radius r to the area of the circle. The ratio can be written
r0 r0
q r02
cos−1 1 − 2 rr0 −
r
0 r − 1− r r2
A = (6)
r π
r0
As (6) is increasing with r , we can evaluate
r0 2r − √12 + γ
≥ Using (5)
r r
√1 − 0.02
≥2− 2 γ > γ0 = 0.02
r
√1 − 0.02
1
≥ 2− 12 > 0.01 r > ρ = √ − 0.008
√ − 0.008 2 2
2
16
Lemma 2 (Robustness of parity classifier). There exists a concept class H such that for any γ ∈ γ0 , γ0 + 18 ,
k ∈ Z+ , P being the corresponding (ρ, k) 1-bit parity class distribution where ρ, γ0 are the same as in Lemma 1
there exists g ∈ H such that
R(g; P) = 0 RAdv,γ (g; P) = 0
Proof of Lemma 2. We will again provide a proof by construction. Consider the following class of concepts
H such that gb ∈ H is defined as
(
>
1 if [x1 ] + [x2 ] = b (mod 2)
g (x1 , x2 ) = (7)
1 − b o.w.
where [x] rounds x to the nearest integer and b ∈ {0, 1}. In Figure 2b, the green staircase-like classifier belongs
to this class. Consider the classifier g1 . Note that by construction R(g1 ; P) = 0. The decision boundary of g1
that are closest to a ball in the support of P centered at (a, b) are the lines x = a ± 0.5 and y = b ± 0.5.
As γ < γ0 + 18 , the adversarial perturbation is upper bounded by 50 1
+ 18 . The radius of the ball is upper
1
bounded by 2√2 , and as we noted the center of the ball is at a distance of 0.5 from the decision boundary.
If the sum of the maximum adversarial perturbation and the maximum radius of the ball is less than the
minimum distance of the center of the ball from the decision boundary, then the adversarial error is 0.
Substituting the values,
1 1 1 1
+ + √ < 0.499 <
50 8 2 2 2
This completes the proof.
drawn from P, and if Sm,η denotes the sample where each label is flipped independently with probability η.
(i) the classifier c ∈ C that minimizes the training error on Sm,η , has R(c; P) = 0 and RAdv,γ (c; P) = 0 for
0 ≤ γ < 1/4.
(ii) there exist h ∈ H, h has zero training error on Sm,η , and R(h; P) = 0. However, for any γ > 0, and for
any h ∈ H with zero training error on Sm,η , RAdv,γ (h; P) ≥ 0.1.
1
Furthermore, the required c ∈ C and h ∈ H above can be computed in O poly (n) , poly 1 −η , poly 1δ
2
time.
Proof of Theorem 3. We will provide a constructive proof to this theorem by constructing a distribution D,
two concept classes C and H and provide the ERM algorithms to learn the concepts and then use Lemma 3
and 4 to complete the proof.
Distribution: Consider the family of distribution Dn such that DS,ζ ∈ Dn is defined on Xζ × {0, 1} for
S ⊆ {1, · · · , n} , ζ ⊆ {1, · · · , 2n − 1} such that the support of Xζ is a union of intervals.
[ 1 1
supp (X )ζ = Ij where Ij := j − , j + (8)
4 4
j∈ζ
17
We consider distributions with a relatively small support i.e. where |ζ| = O (n). Each sample (x, y) ∼ DS,ζ
is created by sampling x uniformly fromS Xζ and assigning y = cS (x) ηwhere cS ∈ C is defined below (9). We
define the family of distributions D = n∈Z+ Dn . Finally, we create DS,ζ -a noisy version of DS,ζ , by flipping
1
y in each sample (x, y) with probability η < 2 . Samples from DS,ζ can be obtained using the example oracle
EX (DS,ζ ) and samples from the noisy distribution can be obtained through the noisy oracle EXη (DS,ζ )
Concept Class C: We define the concept class C n of concepts cS : [0, 2n ] → {0, 1} such that
(
1, if (h[x]ib XOR S) is odd.
cS (x) = (9)
0 o.w.
n n
where [·] : R → Z rounds a decimal to its nearest P integer, h·ib : {0, · · · , 2 } → {0, 1}th returns the binary
encoding of the integer, and (h[x]ib XOR S) = j∈S h[x]ib [j] mod 2. h[x]ib [j] is the j least significant bit
in the binary encoding of the nearest integer to x. It is essentially the class of parity functions defined on the
bits corresponding to the indices in S for the binary encoding of the nearest integer to x. For example, as
in Figure 2a if S = {0, 2}, then only the least significant and the third least significant bit of i are examined
and the class label is 1 if an odd number of them are 1 and 0 otherwise. S∞
Concept Class H: Finally, we define the concept class H = k=1 Hk where Hk is the class of union
of k intervals on the real line Hk . Each concept hI ∈ Hk can be written as a set of k disjoint intervals
I = {I1 , · · · , Ik } on the real line i.e. for 1 ≤ j ≤ k, Ij = [a, b] where 0 ≤ a ≤ b and
( S
1 if x ∈ j Ij
hI (x) = (10)
0 o.w.
Now, we look at the algorithms to learn the concepts from C and H that minimize the train error. Both
of the algorithms will use a majority vote to determine the correct (de-noised) label for each interval, which
will be necessary to minimize the test error. The intuition is that if we draw a sufficiently large number of
samples, then the majority of samples on each interval will have the correct label
with a high probability.
2 (1−η) |ζ|
Lemma 3 proves that there exists an algorithm A such that A draws m = O |ζ| (1−2η) 2 log δ samples
from the noisy oracle EXη (Ds,ζ ) and with probability 1 − δ where the probability is over the randomization
in the oracle, returns f ∈ C such that R(f ; DS,ζ ) = 0 and RAdv,γ (f ; DS,ζ ) = 0 for all γ < 14 . As Lemma 3
states, the algorithm involves gaussian elimination over |ζ| variables and |ζ| majority votes (one in each
interval) involving a total of m samples. Thus the algorithm runs in O (poly (m) + poly (|ζ|))
time. Replacing
1
the complexity of m and the fact that |ζ| = O (n), the complexity of the algorithm is O poly n, 1−2η , 1δ .
Lemma 4 proves that there exists an algorithm A e such that A
e draws
( ! )
2 2 |ζ| (1 − η) 0.1 |ζ| 0.1 |ζ|
m > max 2 |ζ| log 8 2 + 1 , ηγ 2 log
δ (1 − 2η) γδ
samples and returns h ∈ H such that h has 0 training error, 0 test error and an adversarial test error of atleast
0.1. We can replace |ζ| = O (n) to get the required bound on m in the theorem. The algorithm to construct
h visits every point atmost twice - once during the construction of the intervals using majority voting, and
once while accommodating
for the
mislabelled points. Replacing the complexity of m, the complexity of the
1 1 1
algorithm is O poly n, 1−2η , γ , δ . This completes the proof.
18
[xi ] ([·] rounds a decimal to the nearest integer) and then removes duplicate xi s by preserving the most
frequent label yi associated with each xi . For example, if S5 = {(2.8, 1) , (2.9, 0) , (3.1, 1) , (3.2, 1) , (3.9, 0)}
then after this operation, we will have{(3, 1) , (4, 0)}.
2 2|ζ| (1−η) 8(1−η)
As m ≥ 2 |ζ| log δ 8 (1−2η) 2 + 1 , using δ2 = δ
2 and k = (1−2η)2
log 2|ζ|
δ in Lemma 5 guarantees that
δ 8(1−η) 2|ζ|
with probability 1 − 2 , each interval will have atleast (1−2η) 2 log δ samples.
Then for any specific interval, using δ1 = 2|ζ|
δ in Lemma 6 guarantees that with probability atleast 1 − δ ,
2|ζ|
the majority vote for the label in that interval will succeed in returning the de-noised label. Applying a union
bound over all |ζ| intervals, will guarantee that with probability atleast 1 − δ, the majority label of every
interval will be the denoised label.
Now, the problem reduces to solving a parity problem on this reduced dataset of |ζ| points (after denoising,
all points in that interval can be reduced to the integer in the interval and the denoised label). We know that
there exists a polynomial algorithm using Gaussian Elimination that finds a consistent hypothesis for this
problem. We have already guaranteed that there is a point in Sm from every interval in the support of DS,ζ .
Further, f is consistent on Sm and f is constant in each of these intervals by design. Thus, with probability
atleast 1 − δ we have that R(f ; DS,ζ ) = 0.
By construction, f makes a constant prediction on each interval j − 12 , j + 12 for all j ∈ ζ. Thus, for
any perturbation radius γ < 14 the adversarial risk RAdv,DS,0ζ (f )= 0. Combining everything,
we have shown
2 2|ζ| (1−η) η
that there is an algorithm that makes 2 |ζ| log δ 8 (1−2η) 2 + 1 calls to the EX DS,ζ oracle, runs in time
1
polynomial in |ζ| , 1−2η , 1δ to return f ∈ C such that R(f ; DS,ζ ) = 0 and RAdv,γ (f ; DS,ζ ) = 0 for γ < 41 .
Proof of Lemma 4. The first part of the algorithm works similarly to Lemma 3. The algorithm Ae makes
m
m calls to the oracle EX(Ds ) to obtain a set of points Sm = {(x1 , y1 ) , · · · , (xm , ym )} where m ≥
2 2|ζ| (1−η)
2 |ζ| log δ 8 (1−2η)2 + 1 . Ae computes h ∈ H as follows. To begin, let the list of intervals in h be
I and Mz = {} Then do the following for every (x, y) ∈ Sm .
1. let z := [x],
2. Let Nz ⊆ Sm be the set of all (x, y) ∈ Sm such that |x − z| < 0.5.
3. Compute the majority label ỹ of Nz .
4. Add all (x, y) ∈ Nz such that y 6= ỹ to Mz
5. If ỹ = 1, then add the interval (z − 0.5, z + 0.5) to I.
6. Remove all elements of Nz from Sm i.e. Sm := Sm \ Nz .
2
For reasons similar to Lemma 3, as m ≥ 2 |ζ| log 2|ζ|
δ
(1−η)
8 (1−2η) 2 + 1 , Lemma 5 guarantees that with
8(1−η)
probability 1 − 2δ , each interval will have atleast (1−2η)2
log 2|ζ|
δ samples. Then for any specific interval,
2|ζ|
Lemma 6 guarantees that with probability atleast 1 − δ , the majority vote for the label in that interval
will succeed in returning the de-noised label. Applying a union bound over all intervals, will guarantee that
with probability atleast 1 − δ, the majority label of every interval will be the denoised label. As each interval
inζ has atleast one point, all the intervals in ζ with label 1 will be included in I with probability 1 − δ. Thus,
R(h; DS,ζ ) = 0.
Now, for all (x, y) ∈ Mz , add the interval [x] to I if y = 1. If y = 0 then x must lie a interval
(a, b) ∈ I. Replace that interval as follows I := I \ (a, b) ∪ {(a, x), (x, b)}. As only a finite number of sets with
19
lebesgue measure of 0 were added or deleted from I, the net test error of h doesn’t change and is still 0 i.e.
R(h; DS,ζ ) = 0
For the second part, we will invoke Theorem 1. To avoid confusion in notation, we will use Γ instead
of ζ to refer to the sets in Theorem 1 and reserve ζ for the support of interval of DS,ζ . Let Γ be any set of
disjoint intervals of width γ2 such that |Γ| = 0.1|ζ|
γ . This is always possible as the total width of all intervals
in Γ is 0.1|ζ| γ
γ 2 = 0.1 |ζ|
2 which is less than the total width of the support
|ζ|
2 . c1 , c2 from Eq. (3) is
2 ∗ 0.1 |ζ| 2γ
c1 = PDS,ζ [Γ] = = 0.1, c2 = |ζ| = γ
2 |ζ| 2 |ζ|
0.1|ζ| 0.1|ζ|
Thus, if h has an error of zero on a set of m0 examples drawn from EXη (DS,ζ ) where m0 > ηγ 2 log γδ ,
then by Theorem 1, RAdv,γ (h; DS,ζ ) > 0.1.
Combining the two parts for
( ! )
2 2 |ζ| (1 − η) 0.1 |ζ| 0.1 |ζ|
m > max 2 |ζ| log 8 2 + 1 , ηγ 2 log
δ (1 − 2η) γδ
it is possible to obtain h ∈ H such that h has zero training error, R(DS,ζ ; h) = 0 and RAdv,γ (h; DS,ζ ) > 0.1
for any γ > 0.
Lemma 5. Given k ∈ Z+ and a distribution DS,ζ , for any δ2 > 0 if m > 2 |ζ| k + 2 |ζ| log |ζ|
2 2
δ2 samples
are drawn from EX (D S,ζ ) then with probability atleast 1 − δ 2 there are atleast k samples in each interval
j − 14 , j + 14 for all j ∈ ζ.
Proof of Lemma 5. We will repeat the following procedure |ζ| times once for each interval in ζ and show that
δ
with probability |ζ| the j th run will result in atleast k samples in the j th interval.
Corresponding to each interval in ζ, we will sample atleast m0 samples where m0 = 2 |ζ| k + 2 |ζ| log |ζ|
δ2 .
j th th th
If zi is the random variable that is 1 when the i sample belongs to the j interval, then j interval has
δ2
atleast k points out of the m0 points sampled for that interval with probability less than |ζ| .
" # " # " #
X X k X j
P zij ≤k =P zij ≤ (1 − δ) µ δ = 1 − ,µ = E zi
i i
µ i
2 !
k µ
≤ exp − 1 − By Chernoff’s inequality
µ 2
0
k 2 |ζ| m0
m
≤ exp − −k+ µ=
2 |ζ| 2m0 |ζ|
m0
δ2
≤ exp k − ≤
2 |ζ| |ζ|
where the last step follows from m0 > 2 |ζ| k + 2 |ζ| log |ζ| δ
δ2 . With probability atleast |ζ| , every interval will have
atleast k samples. Finally, an union bound over each interval gives the desired result. As we repeat the process
for all |ζ| intervals, the total number of samples drawn will be atleast |ζ| m0 = 2 |ζ| k + 2 |ζ| log |ζ|
2 2
δ2 .
Lemma 6 (Majority Vote). For a given y ∈ {0, 1}, let S = {s1 , · · · , sm } be a set of size m where each
8(1−η) 1
element is y with probability 1 − η and 1 − y otherwise. If m > (1−2η)2 log δ
1
then with probability atleast
1 − δ1 the majority of S is y.
20
Proof of Lemma 6. Without loss of generality let y = 1. For the majority to be 1 we need to show that there
are more than m
2 “1”s in S i.e. we need to show that the following probability is less than δ1 .
X
hX m1 i m1 hX i
P si < =P si < ∗µ+µ−µ µ=E si
2 2µ
X
m1
=P si < 1 − 1 − µ
2µ
!
2
(1 − 2η)
≤ exp − 2µ By Chernoff’s Inequality
8 (1 − η)
!
2
(1 − 2η)
= exp − m ∵ µ = (1 − η) m
8 (1 − η)
8 (1 − η) 1
≤ δ1 ∵m> 2 log
(1 − 2η) δ1
21
References
[1] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing
defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine
Learning, ICML 2018, July 2018. URL https://arxiv.org/abs/1802.00420.
[2] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings
of the National Academy of Sciences, page 201907378, apr 2020. doi: 10.1073/pnas.1907378117.
[3] M. Belkin, D. J. Hsu, and P. Mitra. Overfitting or perfect fitting? risk bounds for classification
and regression rules that interpolate. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems 31, pages 2300–2311. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/
7498-overfitting-or-perfect-fitting-risk-bounds-for-classification-and-regression-rules-that-interp
pdf.
[4] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning.
In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning,
volume 80 of Proceedings of Machine Learning Research, pages 541–549, Stockholmsmssan, Stockholm
Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/belkin18a.html.
[5] M. Belkin, D. Hsu, and J. Xu. Two models of double descent for weak features. arXiv:1903.07571, 2019.
[6] M. Belkin, A. Rakhlin, and A. B. Tsybakov. Does data interpolation contradict statistical optimality?
In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89
of Proceedings of Machine Learning Research, pages 1611–1619. PMLR, 16–18 Apr 2019. URL http:
//proceedings.mlr.press/v89/belkin19a.html.
[7] B. Biggio and F. Roli. Wild patterns. In Proceedings of the 2018 ACM SIGSAC Conference on Computer
and Communications Security. ACM, jan 2018. doi: 10.1145/3243734.3264418.
[8] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE
Symposium on Security and Privacy (SP). IEEE, may 2017. doi: 10.1109/sp.2017.49.
[9] N. Carlini and D. Wagner. Adversarial examples are not easily detected. In Proceedings of the 10th ACM
Workshop on Artificial Intelligence and Security -. ACM Press, 2017. doi: 10.1145/3128572.3140444.
[10] N. S. Chatterji and P. M. Long. Finite-sample analysis of interpolating linear classifiers in the overpa-
rameterized regime. arXiv:2004.12019, 2020.
[11] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving
robustness to adversarial examples. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th
International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,
pages 854–863, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
http://proceedings.mlr.press/v70/cisse17a.html.
[12] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proceedings of
the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD2004.
ACM Press, 2004. doi: 10.1145/1014052.1014066.
[13] A. Degwekar, P. Nakkiran, and V. Vaikuntanathan. Computational limitations in robust classification
and win-win results. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference
on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 994–1028, Phoenix,
USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/degwekar19a.html.
[14] V. Feldman. Does learning require memorization? a short tale about a long tail. arXiv:1906.05271, 2019.
22
[15] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann.
Shortcut learning in deep neural networks.
[16] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. arXiv
preprint arXiv:1412.6572, dec 2014. URL http://arxiv.org/abs/1412.6572.
[17] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In
2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE,
2013.
[18] B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. In K. Chaudhuri and
R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning,
volume 97 of Proceedings of Machine Learning Research, pages 2596–2604, Long Beach, California, USA,
09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/hanin19a.html.
[19] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least
squares interpolation. arXiv:1903.08560, 2019.
[20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance
on imagenet classification. In Proceedings of the IEEE international conference on computer vision,
pages 1026–1034, 2015.
[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE, jun 2016.
ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.90. URL http://ieeexplore.ieee.org/document/
7780459/.
[22] W. He, J. Wei, X. Chen, N. Carlini, and D. Song. Adversarial example defenses: Ensembles of weak
defenses are not strong. In Proceedings of the 11th USENIX Conference on Offensive Technologies,
WOOT17, page 15, USA, 2017. USENIX Association.
[23] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and uncertainty.
Proceedings of the International Conference on Machine Learning, 2019.
[24] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. Natural adversarial examples.
arXiv:1907.07174, 2019.
[25] J.-H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Excessive invariance causes adversarial
vulnerability. In International Conference on Learning Representations, 2019. URL https://openreview.
net/forum?id=BkfbpsAcF7.
[26] S. Kamath, A. Deshpande, and K. V. Subrahmanyam. Invariance vs robustness of neural networks. 2020.
URL https://openreview.net/forum?id=HJxp9kBFDS.
[27] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,
Citeseer, 2009.
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[29] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint
arXiv:1607.02533, 2016.
[30] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. International Conference
on Learning Representations (ICLR), 2017.
[31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
23
[32] J. Li, F. Schmidt, and Z. Kolter. Adversarial camera stickers: A physical camera-based attack on deep
learning systems. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International
Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3896–
3904, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/
v97/li19j.html.
[33] M. Li, M. Soltanolkotabi, and S. Oymak. Gradient descent with early stopping is provably robust to
label noise for overparameterized neural networks. arXiv:1903.11680, 2019.
[34] T. Liang and A. Rakhlin. Just interpolate: Kernel ”ridgeless” regression can generalize. arXiv:1808.00387,
2018.
[35] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of
convolutional neural networks and the coordconv solution. In S. Bengio, H. M. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing
Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8
December 2018, Montréal, Canada, pages 9628–9639, 2018. URL http://papers.nips.cc/paper/
8169-an-intriguing-failing-of-convolutional-neural-networks-and-the-coordconv-solution.
[36] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant
to adversarial attacks. In International Conference on Learning Representations, 2018. URL https:
//openreview.net/forum?id=rJzIBfZAb.
[37] O. Montasser, S. Hanneke, and N. Srebro. Vc classes are adversarially robustly learnable, but only
improperly. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on
Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2512–2530, Phoenix,
USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/montasser19a.html.
[38] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. DeepFool: {A} Simple and Accurate Method to
Fool Deep Neural Networks. In CVPR, pages 2574–2582. {IEEE} Computer Society, 2016.
[39] V. Muthukumar, K. Vodrahalli, V. Subramanian, and A. Sahai. Harmless interpolation of noisy
data in regression. IEEE Journal on Selected Areas in Information Theory, pages 1–1, 2020. doi:
10.1109/jsait.2020.2984716.
[40] P. Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv preprintarXiv:1901.00532,
2019.
[41] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial
perturbations against deep neural networks. arXiv:1511.04508, 2015.
[42] N. Papernot, P. McDaniel, and I. Goodfellow. Transferability in machine learning: from phenomena to
black-box attacks using adversarial samples. arXiv:1605.07277, 2016.
[43] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box
attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer
and Communications Security. ACM, apr 2017. doi: 10.1145/3052973.3053009.
[44] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Adversarial training can hurt
generalization. arXiv:1906.06032, 2019.
[45] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region
proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[46] A. Sanyal, P. K. Dokania, V. Kanade, and P. Torr. Robustness via deep low-rank representations.
https://arxiv.org/abs/1804.07090, 2020.
24
[47] A. Sanyal, P. H. Torr, and P. K. Dokania. Stable rank normalization for improved generalization in
neural networks and {gan}s. In International Conference on Learning Representations, 2020. URL
https://openreview.net/forum?id=H1enKkrFDB.
[48] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization
requires more data. In Advances in Neural Information Processing Systems, pages 5014–5026, 2018.
[49] L. Schnherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa. Adversarial attacks against automatic speech
recognition systems via psychoacoustic hiding. arXiv:1808.05665, 2018.
[50] Y. Shen and S. Sanghavi. Learning with bad training data via iterative trimmed loss minimization.
In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5739–5748, Long Beach,
California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/shen19e.html.
[51] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing
properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[52] F. Tramer, N. Carlini, W. Brendel, and A. Madry. On adaptive attacks to adversarial example defenses.
arXiv:2002.08347, 2020.
[53] F. Tramr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial
training: Attacks and defenses. In International Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=rkZvSe-RZ.
[54] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with
accuracy. In International Conference on Learning Representations, 2019. URL https://openreview.
net/forum?id=SyxAb30cY7.
[55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[56] E. Wong, L. Rice, and J. Z. Kolter. Fast is better than free: Revisiting adversarial training. In
International Conference on Learning Representations, 2020. URL https://openreview.net/forum?
id=BJx040EFvH.
[57] W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks.
arXiv:1704.01155, 2017. doi: 10.14722/ndss.2018.23198.
[58] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. Salakhutdinov, and K. Chaudhuri. Adversarial robustness
through local lipschitzness. arXiv:2003.02460, 2020.
[59] D. Yin, R. Kannan, and P. Bartlett. Rademacher complexity for adversarially robust generalization.
In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7085–7094, Long Beach,
California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/yin19b.html.
[60] S. Zagoruyko and N. Komodakis. Wide residual networks. In Procedings of the British Machine Vision
Conference 2016. British Machine Vision Association, 2016. doi: 10.5244/c.30.87.
[61] C. Zhang and V. Feldman. What neural networks memorize and why: Discovering the long tail
via influence estimation. 2020. URL http://vtaly.net/papers/FZ_Infl_mem.pdf. Unpublished
manuscript.
[62] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires
rethinking generalization. International Conference on Learning Representations (ICLR), nov 2016.
URL http://arxiv.org/abs/1611.03530.
25
[63] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off
between robustness and accuracy. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California,
USA, volume 97 of Proceedings of Machine Learning Research, pages 7472–7482. PMLR, 2019. URL
http://proceedings.mlr.press/v97/zhang19p.html.
26