0% found this document useful (0 votes)

15 views10 pages

Data Augmentation, Loss Functions

Chapter 6 discusses data augmentation techniques and loss functions for training neural networks. Data augmentation involves creating variants of input data to build invariance to complex nuisances, while loss functions like MSE, Huber, and Cross-Entropy are used to optimize model performance. The chapter emphasizes the importance of selecting appropriate augmentations and loss functions based on the specific application and data characteristics.

Uploaded by

pankti parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views10 pages

Data Augmentation, Loss Functions

Uploaded by

pankti parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1 Chapter 6

2 Data augmentation, Loss

3 functions

Reading
1. Bishop Chapter 5.5.3, 4.3

2. Goodfellow Chapter 7.4

4 6.1 Data augmentation

5 In the previous chapter, we looked at convolutions as a way to reduce
6 the parameters in a deep network, but more importantly as a way of
7 building equivariance/invariance to translations. There are a lot of
8 nuisances other than translation that do not have a group structure which
9 precludes operations such as convolutions that we can perform to generate
10 equivariance/invariance.
11 In this section, we will discuss techniques to build invariance to
12 nuisances that are more complex than just translations, these techniques
13 will seem brute-force but they also allow us to handle more complex
14 nuisances. The main trick is to augment the data, i.e., create variants of
15 each input datum in some simple way such that we know that its label is
16 unchanged. If our original dataset is D = (xi , y i ) i=1,...,n we create an
17 augmented dataset

T (D) := (T (xi ), y i )

i=1,...,n
∪ D. (6.1)

18 where T is some operation of our choice. We have therefore expanded

19 the number of samples in the training dataset to 2n instead of the original
20 n. Effectively, data augmentation is a technique to create a dataset that is
21 sampled from some other data distribution P than the original one.

1
2

1 6.1.1 Some basic data augmentation techniques

2 The most popular data augmentation techniques are setting T to be changes
3 in brightness, contrast, cropping the image to simulate occlusions, flipping
4 the image horizontally or vertically, jittering the pixels of the input image
5 to simulate noise in the CCD of the camera/weather, padding the image
6 which changes the borders of the input image, warping the image using a
7 projection that simulates the same picture taken from a different viewpoint, FastAI is a wrapper on top of PyTorch and
8 thresholding the RGB color channels, zooming into an image to simulate is an excellent library to learn for doing your
9 changes in the scale etc. course projects.
10 You can see these operations at [Link]
11 of-transforms.

12 6.1.2 How does augmentation help?

13 A number of such augmentations are applied to the input data while
14 training a deep network. This increases the number of samples n we have
15 for training but note that different samples share a lot of information, so
16 the effective novel samples has not increased by much. Let us get an idea
17 of when augmentation is useful and when it is not. Consider a regression
18 and classification problem as shown below.

Figure 6.1: Cows live in many different parts of the world. A classifier that also
uses background information to predict the category is likely to make mistakes
when it is run in a different part of the world. Augmenting the input dataset on
the left by replacing the background to include a mountain or a city is therefore
a good idea if we want to run the classifier in a different part of the world. This
will also force the classifier to ignore the background pixels when it classifies the
cow, in other words the classifier is forced to become invariant to backgrounds by
brute-force showing it different backgrounds.

19 In essence, data augmentation forces the model to tackle a larger

20 dataset than our original dataset. The model is forced to learn what
3

1 nuisances the designer would like it to be invariant to. Compare this to the
2 previous chapter: by replacing fully-connected layers with convolutions
3 and pooling we made the model invariant to translations. In principle,
4 we could have trained a fully-connected deep network on a very large
5 augmented dataset with translated objects. In principle, this would make
6 the fully-connected network invariant to translations as well.

7 6.1.3 What kind of augmentation to use when?

8 In the example with regression, we saw that the regressor on the augmented
9 data was essentially linear and had much less discriminative power than a
10 polynomial regressor. This was of course by design, we chose to augment
11 the data. If the test data for the problem came from the polynomial instead
12 of our augmented distribution, the new classifier will perform poorly.

Figure 6.2: The second panel shows the original scene with a mirror flip (i.e.,
across the horizontal axis) while the third panel shows the original scene after a
water reflection (i.e., flip across the vertical axis). The latter is an image that is
very unlikely to occur in the real world, so it is not a good idea to use it for training
the model.

By being invariant to a larger set of nuisances than necessary,

we are wasting the parameters of the model and risk getting a large
error if the test data was not from the augmented distribution. By
being invariant to a smaller set of nuisances than necessary, we are
risking the situation that the test data will have some new nuisances
which the classifier will perform poorly on. It is important to bear in
mind that we do not always know what nuisances the model should
? If you are building a classifier for detecting
be invariant to, the set of transformations in data augmentations
cars, motorbikes, people etc. for autonomous
necessarily depends—often critically—upon the application.
driving application, do you want to be the
invariant to rotations?
13 Data augmentation requires a lot of domain expertise and often plays
14 a huge role in the performance of a deep network. You should think about
15 what kind of augmentations you will apply to data for speech processing,
16 or for data from written text.

17 6.2 Loss functions

18 We next discuss the various loss functions that are typically used for
19 training neural networks. As usual, we are given a dataset

D = (xi , y i )

i=1,...,n
.
4

1 6.2.1 Regression
2 MSE loss. If the labels are real-valued y i ∈ R, e.g., we are predicting
3 the price of housing in Boston given features of the houses (like you did
4 in HW 0), we are solving a regression problem and the loss function to
5 use for a deep network is also simply the regression loss.

1 2
ℓmse (w) := (f (x; w) − y) (6.2)
2
6 If you think about it carefully, it seems silly to add different dimen-
7 sions of the input x using the weights w. Consider the case of x =
8 [miles/gallon, number of other people with the same car, price of the car].
9 The three elements of x are in totally different units and totally different
10 scales. A popular trick to make things a bit more uniform for regression
11 is take a logarithmic transformation of the input, i.e., fit a model to log x
12 using the loss
1 2
(f (log x; w) − y) ;
2
13 we can compute the logarithm element-wise for vector valued inputs.

14 Huber loss. The square-residual loss in (6.2) works in most cases but it
15 does not work well if there are outliers in the data. Outliers are data in the We can perform regression in a clever way:
16 training set that are noisy or did not come from the true model. In such first set all weights wi = 0 and iteratively
17 cases, we can use the Huber loss. If the residual is r = f (x; w) − y, the allow a subset of the weights (say the ones
18 Huber loss is that improve the residuals the most) to
( become non-zero; non-zero weights are fitted
1 2
|r| if |r| ≤ δ using ℓmse . This is known as forward
ℓhuber (w; δ) = 2 (6.3)
δ |r| − 12 δ else.

selection. Backward selection starts with
weights w∗ which minimize ℓmse and
19 Observe that this does not penalize the model egregiously if the predictions iteratively prune the weights. Both forward
20 are bad (|r| ≥ δ) for a particular datum. Doing so prevents the outliers and backward selection are techniques to fit a
21 from biasing the loss towards themselves and ruining the residuals for the model w∗ with sparse weights.
22 other data.

23 MAE loss. The absolute-error loss (or ℓ1 )

ℓmae (w) = |f (x; w) − y| (6.4)

24 has a similar motivation: it does not penalize the residual on the outliers.
25 Using a subset-selection technique or the ℓmae loss leads to sparse
26 weights w∗ . This makes the model more interpretable than a model
27 fitted using ℓmse loss. This is easy to understand for linear models: input
28 dimensions corresponding to weights wi∗ that are zero do not take part in
29 making predictions. So one may answer questions of the form “is variable
30 xi a relevant predictor of the target y”.

31 Variable importance. For linear models, another way to answer the

32 same question is to fit two models, one with wi fixed to zero and all other
5

1 weights fitted using the MSE loss (6.2) and another model without fixing
2 wi ; the difference between the average square residuals in the two cases
3 is a measure of how important the feature xi is for the prediction. These
4 techniques are called variable importance methods. We can also undertake
5 the same program for nonlinear models on non-image based data.

6 Quantile loss. The quantile loss is another simple trick to make the model
7 more robust to outliers and get more information from the model than
8 simply the prediction f (x; w). Observe that if we have targets Y that are
9 random variables with cumulative distribution function F (y) = P(Y ≤ y),
10 the τ th quantile of Y is given by

QY (τ ) = F −1 (τ ) = inf {y : F (y) ≥ τ }

11 for τ ∈ (0, 1). We now learn a predictor for QY (τ ) = f (x; w). It turns
12 out (you can try to prove this) that this corresponds to the loss function
(
r(τ − 1) if r < 0
ℓquantile (w; τ ) = The quantile loss is also called the pinball
rτ else. (6.5) loss. Unlike the regression loss, it is highly

= r τ − 1{r<0} . asymmetric around the origin. If r > 0, we
are penalizing the model by τ |r|, and if r < 0,
13 where r = y − f (x; w) is the residual. A standard technique is to i.e., if we predict something that is larger than
14 fit multiple models using the quantile loss for different quantiles, say the true y, then we are penalizing the model
15 τ = 0.25, 0.5, 0.75 and give multiple predictions of the target f (x; wτ ). by (1 − τ )|r|.
16 A typical example of quantile linear regression looks as follows.

18 6.2.2 Classification: Cross-Entropy loss

19 We next discuss the case when the targets are categorical and we wish to
20 train a discriminative model that classifies the input into one of these m
21 categories
y ∈ {1, . . . , m} .
6

1 One hot encoding.

2 An alternative representation of the targets in classification is so-called
3 the one-hot encoding where y is transformed to

one-hot(y) = ey ∈ Rm ;

4 the vector ey has a 1 at the y th element and zeros everywhere else. The
5 notation ey denotes the y th row of the identity matrix Im×m .

6 Predicting class probabilities.

7 Instead of using the regression loss by treating y as a real-valued quantity,
8 it is more natural to predict the log-probability log p(k | x) for every
9 category k using weights w and predict the category using

f (x; w) = argmax log pw (k | x). (6.6)

10 Just like we denoted the raw predictions of the model by ŷ in linear/logistic

11 regression, we will denote

Rm ∋ ŷ = v ⊤ σ SL⊤ . . . σ S2⊤ σ(S1⊤ x) . . .

(6.7)

12 where v ∈ Rp×m . As we saw in Chapter 4, ŷ are also called logits.

13 Observe that the logits ŷ are simply vectors in Rm . How can we transform
14 these logits to get log pw (k | x) for all k ∈ {1, . . . , m} as the output of
15 the model?

16 Logistic loss.
17 Linear logistic regression has a scalar output ŷ ∈ R which is interpreted
18 as the log-odds of the class probabilities

p(1 | x)
log = ŷ = w⊤ x. (6.8) ? We saw a different expression for the
p(0 | x)
logistic loss in Chapter 3
19 This expression can be rewritten as p(1 | x) = sigmoid(ŷ). The likelihood
ℓlogistic (w) = log 1 + e−y ŷ .

20 of data x under this model for y i ∈ {0, 1} is
n
Y i i What is the difference?
pw ( (x1 , y 1 ), . . . , (xn , y n ) ) = pw (1 | xi )y pw (0 | xi )1−y .

i=1

21 Maximizing this probability (MLE) is the same as minimizing the

22 log-probability

ℓlogistic (w) := − log pw ( (x1 , y 1 ), . . . , (xn , y n ) )

n
X (6.9)
=− y i log pw (1 | xi ) + (1 − y i ) log pw (0 | xi )
i=1

23 In other words, the logistic loss is simply maximum-likelihood estimation

24 for the model (6.8).
7

1 Binary Cross-Entropy loss.

2 Let us turn back to neural networks and multi-class classification. Imagine
3 if each logit of a neural network in (6.7) acts independently, i.e., it predicts
4 whether there is class k in this input or not without paying heed to what
5 the other logits predict. This is not very prudent, for instance, if we know
6 beforehand that there is only one object in the input image, then such a
7 classifier is likely to have lots of false positives. Nevertheless, observe
8 that this is exactly like running m independent binary logistic classifiers
9 with the same feature hL ∈ Rp . We can write the loss for such a classifier
10 succinctly as
m
X
ℓbce (w) = − one-hot(y)k log pw (k | x). (6.10)
k=1

11 If the ground-truth labels y i are such that there is only one class in each
12 input image, all entries of one-hot(y i ) at other categories will be zero, so
13 this loss penalizes only the output of one of the m independent logistic
14 classifiers.

15 6.2.3 Softmax Layer

16 Observe that our classifier which employs m binary logistic classifiers
17 for predicting all the categories independently does not predict a valid
18 probability distribution because
m
X
pw (k | x)
k=1

19 is not always equal to 1. We can however posit that the model predicts
20 logits ŷ that are proportional to the log-probabilities

log pw (k | x) ∝ ŷk You will often see people calling

eŷk /T (6.11)
⇒ pw (k | x) = Pm ŷk′ /T
. m
k′ =1 e
X
log eŷk′ /T
k′ =1
21 The result pw (k | x) is a valid distribution on k because it sums up to 1.
22 This operation, namely taking the logits ŷ and constructing a probabilities as the “softmax” of vector ŷ. This is actually
23 out of them is called as a softmax operator. The constant T in (6.11) is a more appropriate usage of the word because
24 called the temperature. A large value of T results in a smoother probability
m
25 distribution pw (k | x) because the individual values of the logits matter X
log eŷk /T ≈ max ŷ
26 less. A small value of T results in a very large weight due to the exponent k
k=1
27 on the largest logit and the distribution pw (k | x) is therefore highly
28 spiked. The temperature is set to 1 by default in PyTorch. if one of the entires of ŷ is much larger than
29 The cross-entropy loss is now simply the maximum-likelihood loss the others, or if T → 0. We will however use
the word “softmax” to refer to the operation
of transforming ŷ into pw (k | x) because we
do not have any need for this softened version
of the max operator.
8

1 after the softmax operation

m
X
ℓce (w) = − one-hot(y)k log pw (k | x)
k=1
m
! (6.12)
ŷy X
= − + log eŷk′ /T .
T
k′ =1

2 Observe that the logit corresponding to the true class ŷy is being pushed
3 higher; at the same time, if the logits of the incorrect classes are large they
4 are being pulled down in the summation. This is an important point to
5 keep in mind: the cross-entropy loss after softmax affects all logits, not
6 just the logit of the correct class.

7 6.2.4 Label smoothing

8 The correct logit in (6.12) is encouraged to go to +∞ while the incorrect
9 logits are encouraged to go to −∞. This can lead to dramatic over-fitting
10 when the number of classes m is very large. Label smoothing is a trick
11 that alleviates the problem: instead of using a one-hot encoding of the
12 true label y, it uses the encoding
(
1 − ϵ if k = y,
label-smoothing(y)k = ϵ
(6.13)
m−1 else.

13 The cross-entropy loss with this new encoding is now

m
X
ℓlabel-smoothing-ce (w) = − label-smoothing(y)k log pw (k | x)
k=1
ϵ X
= −(1 − ϵ) log pw (y | x) − log pw (k | x)
m−1
k̸=y
(6.14)
14 If you take the derivative of this loss with respect to ŷ you will see that
15 the value of ŷ that minimizes the loss is
(
∗ log ((m − 1)(1 − ϵ)/ϵ) + α if k = y
ŷk = (6.15)
α else.

16 where α is an arbitrary real number. Notice that logits for both the correct
17 and the incorrect classes are finite in this case, they no longer blow up to
18 infinity.

19 6.2.5 Multiple ground-truth classes

20 If there are multiple classes that are all present in the input image, i.e., if
21 the ground truth data has multiple labels, we can easily use the vector
X
multi-hot(y) = ek
k
9

1 for all the present classes k and set

m
X
ℓbce (w) = − multi-hot(y)k log pw (k | x) (6.16)
k=1

2 in the BCE loss. We can also use this trick in the cross-entropy loss
3 after the softmax operator but it will not work well because the softmax
4 operator is designed to amplify only the largest logit in ŷ; if we tried the
5 network would still be incentivized to predict only one class instead of all
6 classes.
1 Bibliography

6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Exercises INF 5860 Solution Hints
No ratings yet
Exercises INF 5860 Solution Hints
11 pages
Practical Aspects of Deep Learning
No ratings yet
Practical Aspects of Deep Learning
46 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
155 pages
DL Unit-2
100% (1)
DL Unit-2
24 pages
Practical Deep Learning Techniques
No ratings yet
Practical Deep Learning Techniques
30 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Subtitle
No ratings yet
Subtitle
2 pages
Non-Linear Models Explained
No ratings yet
Non-Linear Models Explained
8 pages
Gradient-Based Learning & Neural Networks
No ratings yet
Gradient-Based Learning & Neural Networks
72 pages
Lecture 17
No ratings yet
Lecture 17
49 pages
Convolutional Neural Networks Explained
No ratings yet
Convolutional Neural Networks Explained
50 pages
Data Augmentation Guide & Techniques
No ratings yet
Data Augmentation Guide & Techniques
18 pages
Instructor's Solution Manual For Neural Networks
0% (1)
Instructor's Solution Manual For Neural Networks
40 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
A Kernel Theory of Modern Data Augmentation
No ratings yet
A Kernel Theory of Modern Data Augmentation
25 pages
1 s2.0 S0925231221009486 Main
No ratings yet
1 s2.0 S0925231221009486 Main
7 pages
6 Batchnorm
No ratings yet
6 Batchnorm
30 pages
Module 1 Lab 2
No ratings yet
Module 1 Lab 2
7 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Ds 2
No ratings yet
Ds 2
27 pages
Jimaging 09 00046 v2
No ratings yet
Jimaging 09 00046 v2
26 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Stochastically Reducing Overfitting in D
No ratings yet
Stochastically Reducing Overfitting in D
5 pages
Training Neural
No ratings yet
Training Neural
16 pages
SVM 2
No ratings yet
SVM 2
8 pages
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
No ratings yet
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
27 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
DLUNIT2
No ratings yet
DLUNIT2
25 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
300 PDF
No ratings yet
300 PDF
8 pages
Machine Learning Exercises Overview
No ratings yet
Machine Learning Exercises Overview
5 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
110 pages
Application of Data Augmentation On Deep Learning
No ratings yet
Application of Data Augmentation On Deep Learning
13 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
FDL Module2
No ratings yet
FDL Module2
37 pages
Cours 6
No ratings yet
Cours 6
26 pages
Sar Target Classification With Cyclegan Transferred Simulated Samples
No ratings yet
Sar Target Classification With Cyclegan Transferred Simulated Samples
4 pages
Interpolation in Deep Learning Theory
No ratings yet
Interpolation in Deep Learning Theory
51 pages
Lec 2
No ratings yet
Lec 2
5 pages
Quadrant Data Efficient Machine Learning Screen
No ratings yet
Quadrant Data Efficient Machine Learning Screen
6 pages
Advanced Machine Learning Techniques
No ratings yet
Advanced Machine Learning Techniques
61 pages
Data Augmentation in Machine Learning
No ratings yet
Data Augmentation in Machine Learning
4 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
79 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Lecture 11 14
No ratings yet
Lecture 11 14
91 pages
21D070048 SRE Report
No ratings yet
21D070048 SRE Report
4 pages
Unit3 CNN
No ratings yet
Unit3 CNN
66 pages
ML1 17 Hepsi
No ratings yet
ML1 17 Hepsi
90 pages
Topical and Transdermal Drug Delivery Systems Applications and Future Prospects - 1st Edition One-Click Ebook Download
100% (17)
Topical and Transdermal Drug Delivery Systems Applications and Future Prospects - 1st Edition One-Click Ebook Download
16 pages
Scania Oil STO 1:0 High Performance Gearbox 75W-90: Issue 1
No ratings yet
Scania Oil STO 1:0 High Performance Gearbox 75W-90: Issue 1
4 pages
Final Report TEG
No ratings yet
Final Report TEG
29 pages
Understanding Art and Humanities
No ratings yet
Understanding Art and Humanities
11 pages
BI Form 2 KERTAS 2 pp2
No ratings yet
BI Form 2 KERTAS 2 pp2
4 pages
Indigenizing Social Sciences in the Philippines
No ratings yet
Indigenizing Social Sciences in the Philippines
62 pages
Admissions Process at Christian American School
No ratings yet
Admissions Process at Christian American School
10 pages
Order and Chaos in Dynamical Astronomy - Professor George Contopoulos (Auth.) - Astronomy and Astrophysics Library, 1, 2002 - Springer-Verlag
No ratings yet
Order and Chaos in Dynamical Astronomy - Professor George Contopoulos (Auth.) - Astronomy and Astrophysics Library, 1, 2002 - Springer-Verlag
632 pages
An Experimental Study of The Polycrystalline Plasticity of
No ratings yet
An Experimental Study of The Polycrystalline Plasticity of
18 pages
Solution Concentration
No ratings yet
Solution Concentration
21 pages
Accelerated Creep Testing
No ratings yet
Accelerated Creep Testing
1 page
X-Foundation Major Test Instructions
No ratings yet
X-Foundation Major Test Instructions
16 pages
Achieving Goals
No ratings yet
Achieving Goals
19 pages
December 2020 Science Exam Paper
No ratings yet
December 2020 Science Exam Paper
3 pages
Understanding Electronic Voice Phenomena
No ratings yet
Understanding Electronic Voice Phenomena
6 pages
(13,14) Slum Development Programmes
No ratings yet
(13,14) Slum Development Programmes
25 pages
1 s2.0 S0956713515002546 Main PDF
No ratings yet
1 s2.0 S0956713515002546 Main PDF
9 pages
Grade 10 English 15-Minute Tests
No ratings yet
Grade 10 English 15-Minute Tests
6 pages
Rca3055 Rca
No ratings yet
Rca3055 Rca
7 pages
AI in Drug Discovery: 2020 Landscape Overview
No ratings yet
AI in Drug Discovery: 2020 Landscape Overview
130 pages
Materials in Action
No ratings yet
Materials in Action
26 pages
Project Management Essentials
No ratings yet
Project Management Essentials
36 pages
Lab Safety Survey Checklist
No ratings yet
Lab Safety Survey Checklist
5 pages
The Effectiveness of Simulation Methods in Teaching Physics On Students' Academic Success
No ratings yet
The Effectiveness of Simulation Methods in Teaching Physics On Students' Academic Success
5 pages
Catalogue 2020/2021: Werma Signaltechnik GMBH + Co. KG
No ratings yet
Catalogue 2020/2021: Werma Signaltechnik GMBH + Co. KG
274 pages
Geometric Programming: Convex Form Software See Also References
No ratings yet
Geometric Programming: Convex Form Software See Also References
2 pages
Cater
No ratings yet
Cater
2 pages
How To Draw Plate Sheet
No ratings yet
How To Draw Plate Sheet
4 pages
Understanding Startup Culture Dynamics
No ratings yet
Understanding Startup Culture Dynamics
1 page
Transfer Functions in Control Systems
No ratings yet
Transfer Functions in Control Systems
9 pages

Data Augmentation, Loss Functions

Uploaded by

Data Augmentation, Loss Functions

Uploaded by

1 Chapter 6

2 Data augmentation, Loss

2. Goodfellow Chapter 7.4

4 6.1 Data augmentation

18 where T is some operation of our choice. We have therefore expanded

1 6.1.1 Some basic data augmentation techniques

12 6.1.2 How does augmentation help?

19 In essence, data augmentation forces the model to tackle a larger

7 6.1.3 What kind of augmentation to use when?

By being invariant to a larger set of nuisances than necessary,

17 6.2 Loss functions

23 MAE loss. The absolute-error loss (or ℓ1 )

ℓmae (w) = |f (x; w) − y| (6.4)

31 Variable importance. For linear models, another way to answer the

18 6.2.2 Classification: Cross-Entropy loss

1 One hot encoding.

6 Predicting class probabilities.

f (x; w) = argmax log pw (k | x). (6.6)

10 Just like we denoted the raw predictions of the model by ŷ in linear/logistic

Rm ∋ ŷ = v ⊤ σ SL⊤ . . . σ S2⊤ σ(S1⊤ x) . . .

12 where v ∈ Rp×m . As we saw in Chapter 4, ŷ are also called logits.

21 Maximizing this probability (MLE) is the same as minimizing the

ℓlogistic (w) := − log pw ( (x1 , y 1 ), . . . , (xn , y n ) )

23 In other words, the logistic loss is simply maximum-likelihood estimation

1 Binary Cross-Entropy loss.

15 6.2.3 Softmax Layer

log pw (k | x) ∝ ŷk You will often see people calling

1 after the softmax operation

7 6.2.4 Label smoothing

13 The cross-entropy loss with this new encoding is now

19 6.2.5 Multiple ground-truth classes

1 for all the present classes k and set

You might also like