0% found this document useful (0 votes)
45 views

Module - 2 Ver 1.4

Regularization techniques are used in machine learning to reduce overfitting and improve generalization. Some common regularization strategies include parameter norm penalties, data augmentation, and early stopping. The goal is to find a balanced model that has low bias and low variance by reducing model flexibility for overfitted models and increasing flexibility for underfitted models. Regularization terms like L1 and L2 norms are added to the objective function to penalize high parameter values and improve generalization to new data.

Uploaded by

Pranav B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Module - 2 Ver 1.4

Regularization techniques are used in machine learning to reduce overfitting and improve generalization. Some common regularization strategies include parameter norm penalties, data augmentation, and early stopping. The goal is to find a balanced model that has low bias and low variance by reducing model flexibility for overfitted models and increasing flexibility for underfitted models. Regularization terms like L1 and L2 norms are added to the objective function to penalize high parameter values and improve generalization to new data.

Uploaded by

Pranav B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Regularization for Deep Learning

Module - 2
Regularization: Definition 2

• Algorithm should perform well on


new inputs, not just on training
data.
• Strategies are used in ML to reduce
test error, by allowing increase in
training error.
• Such strategies collectively are
known as REGULARIZATION
Accuracy of each models with respect to:
• Low bias and low variance is the
goal. Training Data Training Data Training Data
Test Data  Test Data Test Data
Regularization Strategies 3

1. Parameter Norm Penalties 8. Early Stopping


2. Norm Penalties as 9. Parameter tying and
Constrained Optimization parameter sharing
3. Regularization and Under- 10.Sparse representations
constrained Problems 11.Bagging and other ensemble
4. Data Set Augmentation methods
5. Noise Robustness 12.Dropout
6. Semi-supervised learning 13.Adversarial training
7. Multi-task learning 14. Tangent methods
Bias and errors 4

1. Bias is the difference between the predicted value and the expected/true value.
2. The model makes certain assumptions about the data to make the target function simple,
but those assumptions may not always be correct.
3. A high bias model makes more assumptions about the target function.
4. High bias can cause an algorithm to miss the correct relationship between features and the
target output (underfitting).
5. The bias error is the error due to wrong/in-accurate assumptions that the learning
algorithm makes during training.
6. Zero bias may sound good as the model perfectly fits the training data, but this means that
the model has learned too much from the training data, it is called overfitting and the
model will not be able to do a good job with the new/testing data.
Variance and errors 5

• The Variance is when the model considers the fluctuations/noise


in the data during training.
• Variance is the error due to sensitivity to small fluctuations in the
dataset.
• A high variance model learns too much from the data as it still
considers the noise as something to learn from, as a result, it
becomes very sensitive to any small fluctuation, and it overfits
the training data.
• High variance can cause an algorithm to model the random noise
in the training data, rather than the intended outcome.
Bias and Variance Trade-off 6
Generalization Error & Regularization 7

• Generalization error (also known as the out-of-sample error or


the risk) is a measure of how accurately an algorithm can predict
outcome values for previously unseen data.

• Regularization = any modification made to a learning algorithm


that is intended to reduce its generalization error but not its
training error.
Regularization strategies 8

• Option-1: put extra constraints on a machine learning model,


such as adding restrictions on the parameter values.
• Option-2: add extra terms in the objective function that can be
thought of as corresponding to a soft constraint on the parameter
values.
• Other forms of Regularizations: ensemble methods, dropout etc.
Strategies to create large, deep, regularized
model for deep learning. 9

1. Parameter Norm Penalties: Penalizing the estimation using an


extra parameter in the error function.
2. L1 and L2 regularizations are some of the techniques used to
address the overfitting issues.
Context of the example. 10

• Trying to predict the number of matches won based on age.

• Underfit Overfit
Balanced fit 11
Improving the performance of model-1 12

Scenario-1: The model’s poor performance on the training data


could be because the model is too simple.
• Increase the model flexibility.
• Add new domain-specific features.
• Decrease the amount of regularization used.
Improving the performance of model-2 13

Scenario-2: The model is overfitting the training data; reduce


model flexibility.
• Reduce the model flexibility.
• Consider using fewer feature combinations.
• Increase the amount of regularization used.
Reduce the overfitting issue 14
Error computation 15

• Using mean squared error function.


Substituting for the predicted value 16

The predicted value is a


higher-order polynomial,
and this varies depending
on the problem domain.

X1 and X2 represents the


age of a person in the
given example.
Objective of the regularization 17

• To minimize the error in each iteration.


• L2 and L1 regularizations are used.
L2 Regularization 18

• Add a new parameter to penalize the model heavily.


• Penalize higher values of theta; which will make the error bigger every time the theta
gets bigger.
L1 Regularization 19

• Consider the absolute value of theta in the added parameter.


Norm Penalties and Norm? 20

• The norm is a quantity which describes the size of a vector.


• When a vector is stretched, the norm is multiplied by the stretching factor
• The norm of the sum of two vectors is less than or equal to the sum of the
norm of each individually
• The norm can never be negative
• The zero vector has norm 0
L2 Norm, L1 Norm etc… 21

• The L2 norm is the most common norm function in machine learning. Its
definition is the same as the Euclidean distance formula between the endpoint
of the vector and the origin:

• Commonly used L1 norm is simply the sum of the elements of the vector:

• In machine learning, norms are used for:


• Defining a loss function in terms of the magnitude of the distance between predicted and
actual points
• Defining a regularization term which includes the magnitude of the weights, to
encourage small weights
Norm Penalties as Constrained Optimization 22
Generalized Lagrange Function 23

• The constrained Optimization Problem requires us to minimize the


function while ensuring the point discovered belongs to the
feasible set.

Original Function Arbitrary constant and Arbitrary constant and


equality Function inequality Function

Solution to Generalized
Lagrange Function
Insight into the effect of constraint 24

• Constraining the norm of each layer separately prevents any one


hidden unit from having very large weights.
• When using high learning rates, it is possible to encounter a
positive feedback loop in which large weights induce large
gradients which then induce a large update to the weights.
Underconstrained Problems 25

• Many linear models in ML depends on inverting the sample space


dimension matrix to solve regression problems.

• Inverting a singular matrix will not help in solving regression in


linear algebra.
Solution for underconstrained problems 26

• Weight decay
• Weight decay is a regularization technique of adding a small penalty,
usually the L2 norm of the weights (all the weights of the model), to the
loss function.
loss = loss + weight decay parameter * L2 norm of the weights
Loss = MSE(y_hat, y) + wd * sum(w^2)
• Regularization help in stopping the iteration when slope of likelihood
equals weight decay coefficient.
Regularization in linear algebra problems 27

• We can solve underdetermined linear equations using Moore


Penrose pseudoinverse
Dataset Augmentation 28

• The best way to make a machine learning model generalize better


is to train it on more data.
• In practice, sample data is limited.
• One way to get around this is – CREATE FAKE DATA (data synthesis)
and ADD IT TO DATA SET = Data Augmentation
Augmentation scenarios 29

• Data augmentation for classification => Generate new sample (x,y)


by transforming inputs.
• Not suitable for density estimation (without solving prior to
synthesis)
• Effective for object recognition, speech recognition.
• Injecting noise in the input for neural network is also a form of
augmentation.
Dataset augmentation sample 30
Noise Robustness 31

• Noise applied at inputs


• Noise applied to weights
• Injecting noise at the output targets
Noise injection is powerful 32

• Noise applied to inputs is a data augmentation


• For some models addition of noise with infinitesimal variance at the input is
equivalent to imposing a penalty on the norm of the weights.
• Noise applied to hidden units
– Noise injection can be much more powerful than simply shrinking the
parameters
– Noise applied to hidden units is so important that it merits its own separate
discussion
• Dropout is the main development of this approach
Adding Noise to Weights 33

•This technique primarily used with RNNs (Recurrent neural network)


•This can be interpreted as a stochastic implementation of Bayesian
inference over the weights
•Bayesian learning considers model weights to be uncertain and representable
via a probability distribution p(w) that reflects that uncertainty
• This can be seen in a regression setting for labelled data set.
Adding noise to output units 34

• Most datasets have some mistakes in labels, this will maximize the
probability prediction.
• To prevent noise is explicitly labelled in the model.
• Local smoothing is a mechanism to regularize a model based on
softmax
Questions? 35
NEXT CLASS:
Semi-Supervised Learning, Multi-Task Learning

You might also like