0% found this document useful (0 votes)
4 views

Unit-2 L4 (2)

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit-2 L4 (2)

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

REGULARIZATION

SPARSE REPRESENTATIONS
SPARSE REPRESENTATIONS
• Direct and Indirect Penalties
• Direct Penalty
• Weight decay penalizes parameters directly
• L1 penalization induces sparse parameterization
• Indirect Penalty
• Another strategy is to place penalty on the
activations of the units in the neural network
• Encouraging their activations to be sparse
• It imposes a complicated penalty on model
parameters
• Representational sparsity describes a
representation where many of the elements of
SPARSE REPRESENTATIONS
• Direct versus Representational Sparsity
• Parameter regularization, with W=A

• Representational regularization, with W=B


SPARSE REPRESENTATIONS
• Representational Regularization
• Accomplished using the same sort of mechanisms used in parameter
regularization
• Norm penalty regularization of representation
• Performed by adding to the loss function J, a norm penalty on the
representation.
SPARSE REPRESENTATIONS
• Placing constraint on Activation Values
• Another approach to representational sparsity:
• place a hard constraint on activation values
• Called Orthogonal matching pursuit (OMP)
• Encode x with h that solves constrained
optimization:
• where ||h||0 is the number of zero entries of h
• The problem is solved efficiently when W is orthogonal
• Often called OMP-k, where k is no. of zero entries
• Essentially, any model with hidden units can be
made sparse:
REGULARIZATION
BAGGING AND OTHER ENSEMBLE METHODS
BAGGING AND OTHER ENSEMBLE METHODS
• What is bagging?
• It is short for Bootstrap Aggregating
• It is a technique for reducing generalization error
by combining several models
• Idea is to train several models separately, then have all
the models vote on the output for test examples
• This strategy is called model averaging
• Techniques employing this strategy are known as
ensemble methods
• Model averaging works because different models
will not make the same mistake
BAGGING AND OTHER ENSEMBLE METHODS
• Ex: Ensemble error rate
• Consider set of k regression models
• Each model makes error εi on each example, i=1,..N
• Errors drawn from a zero-mean multivariate normal
with variance E[εi 2]=v and covariance E[εiεj ]=c
• Error of average prediction of all ensemble models:
• Expected squared error of ensemble prediction is

• If errors are perfectly correlated, c=v, and mean


squared error reduces to v, so model averaging does
not help
• If errors are perfectly uncorrelated and c=0, expected
squared error of ensemble is only v/k
BAGGING AND OTHER ENSEMBLE METHODS
• Ensemble vs Bagging
• Different ensemble methods construct the ensemble of
models in different ways
• Ex: each member of ensemble could be formed by training a
completely different kind of model using a different algorithm or
objective function
• Bagging is a method that allows the same kind of model,
training algorithm, and objective function to be reused
several times
• The Bagging Technique
• Given training set D of size N, generate k data sets of
same no of examples as original by sampling with
replacement
• Some observations may be repeated in each Di the rest being
BAGGING AND OTHER ENSEMBLE METHODS
• Example of Bagging Principle
• Task of training an 8 detector
• Bagging training procedure
• make different data sets by resampling the given data
set

• Each detector is brittle. Their average is robust achieving


BAGGING AND OTHER ENSEMBLE METHODS
•Neural nets and bagging
• Neural nets reach a wide variety of solution points
• Thus they benefit from model averaging when
trained on the same dataset
• Differences in:
• random initializations
• random selection of minibatches, in hyperparameters,
• cause different members of the ensemble to make
partially independent errors
BAGGING AND OTHER ENSEMBLE METHODS
• Model averaging is powerful
• Model averaging is a reliable method for reducing
generalization error
• Machine learning contests are usually won by model averaging
over dozens of models
• Since model averaging performance comes at the
expense of increased computation and memory,
benchmark comparisons are made using a single model
• Boosting
• Incrementally adding models to the ensemble
• Has been applied to ensembles of neural networks, by
incrementally adding neural networks to the ensemble
• Also interpreting a neural network as an ensemble,
incrementally adding hidden units to the network
REGULARIZATION
DROPOUT
DROPOUT

• Regularization with unlimited computation


• Best way to regularize a fixed size model is:
• Average the predictions of all possible settings of the parameters
• Weighting each setting with the posterior probability given the
training data
• This would be the Bayesian approach
• Dropout does this using considerably less computation
• By approximating an equally weighted geometric mean of the
predictions of an exponential number of learned models that share
parameters
• Dropout is a bagging method
• Bagging is a method of averaging over several models to improve
generalization
• Impractical to train many neural networks since it is expensive in
DROPOUT
• Removing units creates networks
• Dropout trains an ensemble of all subnetworks
• Subnetworks formed by removing non-output units from an
underlying base network
• We can effectively remove units by multiplying its output
value by zero
• For networks based on performing a series of affine
transformations or on-linearities
• Needs some modification for radial basis functions based
on difference between unit state and a reference value
• Dropout Neural Net
• A simple way to prevent neural net overfitting
DROPOUT
• Dropout Neural Net
DROPOUT
• Performance with/without Dropout
DROPOUT
• Dropout as bagging
• In bagging we define k different models, construct k
different data sets by sampling from the dataset with
replacement, and train model i on dataset I
• Dropout aims to approximate this process, but with
an exponentially large no. of neural networks
DROPOUT
• Dropout as an ensemble method
DROPOUT
• Mask for dropout training
• To train with dropout we use minibatch based learning
algorithm that takes small steps such as SGD

• At each step randomly sample a binary mask


• Probability of including a unit is a hyperparameter
• 0.5 for hidden units and 0.8 for input units

• We run forward & backward propagation as usual


DROPOUT
• Forward Propagation with dropout
DROPOUT

• Formal description of dropout


• Suppose that mask vector μ specifies which units to
include
• The cost of the model is specified by J(θ,μ)

• Drop training consists of minimizing Eμ(J(θ,μ))

• The expected value contains an exponential no. of


terms
DROPOUT
• Bagging training vs Dropout training
• Dropout training not same as bagging training
• In bagging, the models are all independent
• In dropout, models share parameters
• Models inherit subsets of parameters from parent network
• Parameter sharing allows an exponential no. of models with a
tractable amount of memory
• In bagging each model is trained to convergence on its
respective training set
• In dropout, most models are not explicitly trained
• Fraction of sub-networks are trained for a single step
• Parameter sharing allows good parameter settings

You might also like