1 Chapter 6
2 Data augmentation, Loss
3 functions
Reading
1. Bishop Chapter 5.5.3, 4.3
2. Goodfellow Chapter 7.4
4 6.1 Data augmentation
5 In the previous chapter, we looked at convolutions as a way to reduce
6 the parameters in a deep network, but more importantly as a way of
7 building equivariance/invariance to translations. There are a lot of
8 nuisances other than translation that do not have a group structure which
9 precludes operations such as convolutions that we can perform to generate
10 equivariance/invariance.
11 In this section, we will discuss techniques to build invariance to
12 nuisances that are more complex than just translations, these techniques
13 will seem brute-force but they also allow us to handle more complex
14 nuisances. The main trick is to augment the data, i.e., create variants of
15 each input datum in some simple way such that we know that its label is
16 unchanged. If our original dataset is D = (xi , y i ) i=1,...,n we create an
17 augmented dataset
T (D) := (T (xi ), y i )
i=1,...,n
∪ D. (6.1)
18 where T is some operation of our choice. We have therefore expanded
19 the number of samples in the training dataset to 2n instead of the original
20 n. Effectively, data augmentation is a technique to create a dataset that is
21 sampled from some other data distribution P than the original one.
1
2
1 6.1.1 Some basic data augmentation techniques
2 The most popular data augmentation techniques are setting T to be changes
3 in brightness, contrast, cropping the image to simulate occlusions, flipping
4 the image horizontally or vertically, jittering the pixels of the input image
5 to simulate noise in the CCD of the camera/weather, padding the image
6 which changes the borders of the input image, warping the image using a
7 projection that simulates the same picture taken from a different viewpoint, FastAI is a wrapper on top of PyTorch and
8 thresholding the RGB color channels, zooming into an image to simulate is an excellent library to learn for doing your
9 changes in the scale etc. course projects.
10 You can see these operations at [Link]
11 of-transforms.
12 6.1.2 How does augmentation help?
13 A number of such augmentations are applied to the input data while
14 training a deep network. This increases the number of samples n we have
15 for training but note that different samples share a lot of information, so
16 the effective novel samples has not increased by much. Let us get an idea
17 of when augmentation is useful and when it is not. Consider a regression
18 and classification problem as shown below.
Figure 6.1: Cows live in many different parts of the world. A classifier that also
uses background information to predict the category is likely to make mistakes
when it is run in a different part of the world. Augmenting the input dataset on
the left by replacing the background to include a mountain or a city is therefore
a good idea if we want to run the classifier in a different part of the world. This
will also force the classifier to ignore the background pixels when it classifies the
cow, in other words the classifier is forced to become invariant to backgrounds by
brute-force showing it different backgrounds.
19 In essence, data augmentation forces the model to tackle a larger
20 dataset than our original dataset. The model is forced to learn what
3
1 nuisances the designer would like it to be invariant to. Compare this to the
2 previous chapter: by replacing fully-connected layers with convolutions
3 and pooling we made the model invariant to translations. In principle,
4 we could have trained a fully-connected deep network on a very large
5 augmented dataset with translated objects. In principle, this would make
6 the fully-connected network invariant to translations as well.
7 6.1.3 What kind of augmentation to use when?
8 In the example with regression, we saw that the regressor on the augmented
9 data was essentially linear and had much less discriminative power than a
10 polynomial regressor. This was of course by design, we chose to augment
11 the data. If the test data for the problem came from the polynomial instead
12 of our augmented distribution, the new classifier will perform poorly.
Figure 6.2: The second panel shows the original scene with a mirror flip (i.e.,
across the horizontal axis) while the third panel shows the original scene after a
water reflection (i.e., flip across the vertical axis). The latter is an image that is
very unlikely to occur in the real world, so it is not a good idea to use it for training
the model.
By being invariant to a larger set of nuisances than necessary,
we are wasting the parameters of the model and risk getting a large
error if the test data was not from the augmented distribution. By
being invariant to a smaller set of nuisances than necessary, we are
risking the situation that the test data will have some new nuisances
which the classifier will perform poorly on. It is important to bear in
mind that we do not always know what nuisances the model should
? If you are building a classifier for detecting
be invariant to, the set of transformations in data augmentations
cars, motorbikes, people etc. for autonomous
necessarily depends—often critically—upon the application.
driving application, do you want to be the
invariant to rotations?
13 Data augmentation requires a lot of domain expertise and often plays
14 a huge role in the performance of a deep network. You should think about
15 what kind of augmentations you will apply to data for speech processing,
16 or for data from written text.
17 6.2 Loss functions
18 We next discuss the various loss functions that are typically used for
19 training neural networks. As usual, we are given a dataset
D = (xi , y i )
i=1,...,n
.
4
1 6.2.1 Regression
2 MSE loss. If the labels are real-valued y i ∈ R, e.g., we are predicting
3 the price of housing in Boston given features of the houses (like you did
4 in HW 0), we are solving a regression problem and the loss function to
5 use for a deep network is also simply the regression loss.
1 2
ℓmse (w) := (f (x; w) − y) (6.2)
2
6 If you think about it carefully, it seems silly to add different dimen-
7 sions of the input x using the weights w. Consider the case of x =
8 [miles/gallon, number of other people with the same car, price of the car].
9 The three elements of x are in totally different units and totally different
10 scales. A popular trick to make things a bit more uniform for regression
11 is take a logarithmic transformation of the input, i.e., fit a model to log x
12 using the loss
1 2
(f (log x; w) − y) ;
2
13 we can compute the logarithm element-wise for vector valued inputs.
14 Huber loss. The square-residual loss in (6.2) works in most cases but it
15 does not work well if there are outliers in the data. Outliers are data in the We can perform regression in a clever way:
16 training set that are noisy or did not come from the true model. In such first set all weights wi = 0 and iteratively
17 cases, we can use the Huber loss. If the residual is r = f (x; w) − y, the allow a subset of the weights (say the ones
18 Huber loss is that improve the residuals the most) to
( become non-zero; non-zero weights are fitted
1 2
|r| if |r| ≤ δ using ℓmse . This is known as forward
ℓhuber (w; δ) = 2 (6.3)
δ |r| − 12 δ else.
selection. Backward selection starts with
weights w∗ which minimize ℓmse and
19 Observe that this does not penalize the model egregiously if the predictions iteratively prune the weights. Both forward
20 are bad (|r| ≥ δ) for a particular datum. Doing so prevents the outliers and backward selection are techniques to fit a
21 from biasing the loss towards themselves and ruining the residuals for the model w∗ with sparse weights.
22 other data.
23 MAE loss. The absolute-error loss (or ℓ1 )
ℓmae (w) = |f (x; w) − y| (6.4)
24 has a similar motivation: it does not penalize the residual on the outliers.
25 Using a subset-selection technique or the ℓmae loss leads to sparse
26 weights w∗ . This makes the model more interpretable than a model
27 fitted using ℓmse loss. This is easy to understand for linear models: input
28 dimensions corresponding to weights wi∗ that are zero do not take part in
29 making predictions. So one may answer questions of the form “is variable
30 xi a relevant predictor of the target y”.
31 Variable importance. For linear models, another way to answer the
32 same question is to fit two models, one with wi fixed to zero and all other
5
1 weights fitted using the MSE loss (6.2) and another model without fixing
2 wi ; the difference between the average square residuals in the two cases
3 is a measure of how important the feature xi is for the prediction. These
4 techniques are called variable importance methods. We can also undertake
5 the same program for nonlinear models on non-image based data.
6 Quantile loss. The quantile loss is another simple trick to make the model
7 more robust to outliers and get more information from the model than
8 simply the prediction f (x; w). Observe that if we have targets Y that are
9 random variables with cumulative distribution function F (y) = P(Y ≤ y),
10 the τ th quantile of Y is given by
QY (τ ) = F −1 (τ ) = inf {y : F (y) ≥ τ }
11 for τ ∈ (0, 1). We now learn a predictor for QY (τ ) = f (x; w). It turns
12 out (you can try to prove this) that this corresponds to the loss function
(
r(τ − 1) if r < 0
ℓquantile (w; τ ) = The quantile loss is also called the pinball
rτ else. (6.5) loss. Unlike the regression loss, it is highly
= r τ − 1{r<0} . asymmetric around the origin. If r > 0, we
are penalizing the model by τ |r|, and if r < 0,
13 where r = y − f (x; w) is the residual. A standard technique is to i.e., if we predict something that is larger than
14 fit multiple models using the quantile loss for different quantiles, say the true y, then we are penalizing the model
15 τ = 0.25, 0.5, 0.75 and give multiple predictions of the target f (x; wτ ). by (1 − τ )|r|.
16 A typical example of quantile linear regression looks as follows.
17
18 6.2.2 Classification: Cross-Entropy loss
19 We next discuss the case when the targets are categorical and we wish to
20 train a discriminative model that classifies the input into one of these m
21 categories
y ∈ {1, . . . , m} .
6
1 One hot encoding.
2 An alternative representation of the targets in classification is so-called
3 the one-hot encoding where y is transformed to
one-hot(y) = ey ∈ Rm ;
4 the vector ey has a 1 at the y th element and zeros everywhere else. The
5 notation ey denotes the y th row of the identity matrix Im×m .
6 Predicting class probabilities.
7 Instead of using the regression loss by treating y as a real-valued quantity,
8 it is more natural to predict the log-probability log p(k | x) for every
9 category k using weights w and predict the category using
f (x; w) = argmax log pw (k | x). (6.6)
k
10 Just like we denoted the raw predictions of the model by ŷ in linear/logistic
11 regression, we will denote
Rm ∋ ŷ = v ⊤ σ SL⊤ . . . σ S2⊤ σ(S1⊤ x) . . .
(6.7)
12 where v ∈ Rp×m . As we saw in Chapter 4, ŷ are also called logits.
13 Observe that the logits ŷ are simply vectors in Rm . How can we transform
14 these logits to get log pw (k | x) for all k ∈ {1, . . . , m} as the output of
15 the model?
16 Logistic loss.
17 Linear logistic regression has a scalar output ŷ ∈ R which is interpreted
18 as the log-odds of the class probabilities
p(1 | x)
log = ŷ = w⊤ x. (6.8) ? We saw a different expression for the
p(0 | x)
logistic loss in Chapter 3
19 This expression can be rewritten as p(1 | x) = sigmoid(ŷ). The likelihood
ℓlogistic (w) = log 1 + e−y ŷ .
20 of data x under this model for y i ∈ {0, 1} is
n
Y i i What is the difference?
pw ( (x1 , y 1 ), . . . , (xn , y n ) ) = pw (1 | xi )y pw (0 | xi )1−y .
i=1
21 Maximizing this probability (MLE) is the same as minimizing the
22 log-probability
ℓlogistic (w) := − log pw ( (x1 , y 1 ), . . . , (xn , y n ) )
n
X (6.9)
=− y i log pw (1 | xi ) + (1 − y i ) log pw (0 | xi )
i=1
23 In other words, the logistic loss is simply maximum-likelihood estimation
24 for the model (6.8).
7
1 Binary Cross-Entropy loss.
2 Let us turn back to neural networks and multi-class classification. Imagine
3 if each logit of a neural network in (6.7) acts independently, i.e., it predicts
4 whether there is class k in this input or not without paying heed to what
5 the other logits predict. This is not very prudent, for instance, if we know
6 beforehand that there is only one object in the input image, then such a
7 classifier is likely to have lots of false positives. Nevertheless, observe
8 that this is exactly like running m independent binary logistic classifiers
9 with the same feature hL ∈ Rp . We can write the loss for such a classifier
10 succinctly as
m
X
ℓbce (w) = − one-hot(y)k log pw (k | x). (6.10)
k=1
11 If the ground-truth labels y i are such that there is only one class in each
12 input image, all entries of one-hot(y i ) at other categories will be zero, so
13 this loss penalizes only the output of one of the m independent logistic
14 classifiers.
15 6.2.3 Softmax Layer
16 Observe that our classifier which employs m binary logistic classifiers
17 for predicting all the categories independently does not predict a valid
18 probability distribution because
m
X
pw (k | x)
k=1
19 is not always equal to 1. We can however posit that the model predicts
20 logits ŷ that are proportional to the log-probabilities
log pw (k | x) ∝ ŷk You will often see people calling
eŷk /T (6.11)
⇒ pw (k | x) = Pm ŷk′ /T
. m
k′ =1 e
X
log eŷk′ /T
k′ =1
21 The result pw (k | x) is a valid distribution on k because it sums up to 1.
22 This operation, namely taking the logits ŷ and constructing a probabilities as the “softmax” of vector ŷ. This is actually
23 out of them is called as a softmax operator. The constant T in (6.11) is a more appropriate usage of the word because
24 called the temperature. A large value of T results in a smoother probability
m
25 distribution pw (k | x) because the individual values of the logits matter X
log eŷk /T ≈ max ŷ
26 less. A small value of T results in a very large weight due to the exponent k
k=1
27 on the largest logit and the distribution pw (k | x) is therefore highly
28 spiked. The temperature is set to 1 by default in PyTorch. if one of the entires of ŷ is much larger than
29 The cross-entropy loss is now simply the maximum-likelihood loss the others, or if T → 0. We will however use
the word “softmax” to refer to the operation
of transforming ŷ into pw (k | x) because we
do not have any need for this softened version
of the max operator.
8
1 after the softmax operation
m
X
ℓce (w) = − one-hot(y)k log pw (k | x)
k=1
m
! (6.12)
ŷy X
= − + log eŷk′ /T .
T
k′ =1
2 Observe that the logit corresponding to the true class ŷy is being pushed
3 higher; at the same time, if the logits of the incorrect classes are large they
4 are being pulled down in the summation. This is an important point to
5 keep in mind: the cross-entropy loss after softmax affects all logits, not
6 just the logit of the correct class.
7 6.2.4 Label smoothing
8 The correct logit in (6.12) is encouraged to go to +∞ while the incorrect
9 logits are encouraged to go to −∞. This can lead to dramatic over-fitting
10 when the number of classes m is very large. Label smoothing is a trick
11 that alleviates the problem: instead of using a one-hot encoding of the
12 true label y, it uses the encoding
(
1 − ϵ if k = y,
label-smoothing(y)k = ϵ
(6.13)
m−1 else.
13 The cross-entropy loss with this new encoding is now
m
X
ℓlabel-smoothing-ce (w) = − label-smoothing(y)k log pw (k | x)
k=1
ϵ X
= −(1 − ϵ) log pw (y | x) − log pw (k | x)
m−1
k̸=y
(6.14)
14 If you take the derivative of this loss with respect to ŷ you will see that
15 the value of ŷ that minimizes the loss is
(
∗ log ((m − 1)(1 − ϵ)/ϵ) + α if k = y
ŷk = (6.15)
α else.
16 where α is an arbitrary real number. Notice that logits for both the correct
17 and the incorrect classes are finite in this case, they no longer blow up to
18 infinity.
19 6.2.5 Multiple ground-truth classes
20 If there are multiple classes that are all present in the input image, i.e., if
21 the ground truth data has multiple labels, we can easily use the vector
X
multi-hot(y) = ek
k
9
1 for all the present classes k and set
m
X
ℓbce (w) = − multi-hot(y)k log pw (k | x) (6.16)
k=1
2 in the BCE loss. We can also use this trick in the cross-entropy loss
3 after the softmax operator but it will not work well because the softmax
4 operator is designed to amplify only the largest logit in ŷ; if we tried the
5 network would still be incentivized to predict only one class instead of all
6 classes.
1 Bibliography
10