0% found this document useful (0 votes)
38 views

Lecture9 Dropout Optimization Cnns

This lecture discusses techniques for improving optimization of neural networks, including momentum and Nesterov momentum methods which help accelerate stochastic gradient descent. Momentum works by incorporating a fraction of the previous step's gradient into the current update. Nesterov momentum takes an extra step by looking ahead rather than using the current gradient. The lecture also covers adaptive gradient methods like RMSProp and Adam which adapt the learning rate for each parameter.

Uploaded by

Saeed Firoozi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lecture9 Dropout Optimization Cnns

This lecture discusses techniques for improving optimization of neural networks, including momentum and Nesterov momentum methods which help accelerate stochastic gradient descent. Momentum works by incorporating a fraction of the previous step's gradient into the current update. Nesterov momentum takes an extra step by looking ahead rather than using the current gradient. The lecture also covers adaptive gradient methods like RMSProp and Adam which adapt the learning rate for each parameter.

Uploaded by

Saeed Firoozi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Lecture 9: Dropout, optimization and convolutional NNs

Announcements:

• HW #3 is due tonight. To submit your Jupyter Notebook, print the notebook to a pdf
with your solutions and plots filled in. You must also submit your .py files as pdfs.

• HW #4 will be uploaded today. Fri, Feb 17

• Midterm exam review session: Thursday, Feb 16, 6-9pm at WG Young CS50.

• Tonmoy DIS 1E (3-4p) moved to Geology 6704.

• All past exams are uploaded to Bruin Learn (under “Modules” —> “past exams”).
This year, we will allow 4 cheat sheets (8.5 x 11” paper) that can be filled front and
back (8 sides total). The exam is otherwise closed book and closed notes.

• A word of thanks. covers up to and incl.

Feb 15
Wednesday,
Prof J.C. Kao, UCLA ECE
Prof J.C. Kao, UCLA ECE
Dropout

prob. keep a neuron.


we
P:

0.5
/

:
100 neurons .

100
Draw
M =
.
BernoulliR.V.'s

w.p. O
Prof J.C. Kao, UCLA ECE
Dropout

13
1R

(i
Mask

-IR13

N units

2N possible configurations.
Prof J.C. Kao, UCLA ECE
Dropout

Dropout in code.

Prof J.C. Kao, UCLA ECE


Dropout

How about during test time? What configuration do you use?


1
#Her
h.oN host

10] (m,h, wsas)


-> 0 m =

=> hout rel =


+


40 p
0.5
=

hy P

ter
2

(w>hc wphp)
187 hour =rel +

m =

XTEST:
hout rel
=

(m, h, + wehz + wshs w+hx).The


+

Over many
iterations, the contribution ofwith;
boat was Nichi.
to
p.
Prof J.C. Kao, UCLA ECE
Dropout

How about during test time? What configuration do you use?

We call this approach the weight scaling inference rule. There is not yet any
theoretical argument for the accuracy of this approximate inference rule in
deep nonlinear networks, but empirically it performs very well.

In this class, instead of scaling the weights, we’ll scale the activations.

Prof J.C. Kao, UCLA ECE


Dropout

22

Note: an additional pro of dropout is that in testing time, there is no additional


complexity. With m ensemble models, our test time evaluation would scale
O(m).

Prof J.C. Kao, UCLA ECE


Inverted dropout

A common way to implement dropout is inverted dropout where the scaling by


1/p is done in training. This causes the output to have the same expected
value as if dropout was never been performed.

Thus, testing looks the same irrespective of if we use dropout or not. See
code below:
h.
4,24
-

Prof J.C. Kao, UCLA ECE


Dropout

How is this a good idea?

1) Dropout approximates bagging, since each mask is like a different


model. For a model with N hidden units, there are 2^N different model
configurations.

Each of these configurations must be good at predicting the output.

2) You can think of of dropout as regularizing each hidden unit to work


well in many different contexts.

3) Dropout may cause units to encode redundant features (e.g., to detect


a cat, there are many things we look for, e.g., it’s furry, it has pointy
ears, it has a tail, a long body, etc.).

Prof J.C. Kao, UCLA ECE


Lecture summary

Here, we’ve covered tricks that we can do in initialization, regularization, and


data augmentation to improve the performance of neural networks.

But what about the optimizer, stochastic gradient descent? Can we improve
this for deep learning?

That’s the topic of our next lecture.

Prof J.C. Kao, UCLA ECE


Optimization for neural networks

In this lecture, we’ll talk about specific techniques in optimization that aid in
training neural networks.

• Stochastic gradient descent


• Momentum and Nesterov momentum
• Adaptive gradients
• RMSProp
• Adaptive moments CADAM)
• Overview of second order methods ECE <- 236B/C
• Challenges of gradient descent

Prof J.C. Kao, UCLA ECE


Reading

Reading:

Deep Learning, Chapter 8 (intro), 8.1, 8.2, 8.3,


8.4, 8.5, 8.6 (skim)

Prof J.C. Kao, UCLA ECE


Where we are now

At this point, we know:

• Neural network architectures. 10SI



• Hyperparameters and cost functions to use for neural networks.
• How to calculate gradients of the loss w.r.t. all parameters in the neural
network. Backprop.
• How to initialize the weights and regularize the network in ways to
improve the training of the network.

We do know how to optimize these networks with stochastic gradient descent.


But can it be improved?

In this lecture, we talk about how to make optimization more efficient and
effective.

Prof J.C. Kao, UCLA ECE


Gradient descent

A refresher on gradient descent.


Loss I
- Cost function: J(✓)

- Parameters: ✓

Then, the gradient descent step is:

✓ ✓ ✏r✓ J(✓)

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

softwax loss, and-grad()


-
c
weights
-s

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

Beale's function.

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

Prof J.C. Kao, UCLA ECE


Finding the optimal weights through gradient descent

Varying the learning rate:

Prof J.C. Kao, UCLA ECE


Stochastic gradient descent

Prof J.C. Kao, UCLA ECE


Momentum

·
9.

-> 93

94

v. 0

so
=

v,
=
-
59,

xv, 392 x291 292


- -

v
= -
=

v (vc
= -

59s =
-

a(ag, agz
+
+

gs)
1(x39, ag2 19s 9+)
+

vy
+
= +
-

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Momentum

Prof J.C. Kao, UCLA ECE


Does momentum help with local optima?

Does momentum help with local optima? saD+momentum

#-
I

-- 1
-

What kind of local optima does momentum tend to find?

Prof J.C. Kao, UCLA ECE


Nesterov momentum

classical momentum.
<v
v <
-

1485(0)

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Nesterov momentum

Prof J.C. Kao, UCLA ECE


Is there a good way to adapt the learning rule?

Annealing.

Prof J.C. Kao, UCLA ECE


Adagrad John Duchi 2011

(i
a
params
zIRM =

gu

pl 19
- >

-or og: gog


=

Prof J.C. Kao, UCLA ECE


Adagrad

Prof J.C. Kao, UCLA ECE


Adagrad

W2

↑qwz
-

NI

Prof J.C. Kao, UCLA ECE


Adagrad

Prof J.C. Kao, UCLA ECE


Adagrad

Prof J.C. Kao, UCLA ECE


Adagrad

Prof J.C. Kao, UCLA ECE


Adagrad

a, a,
=

g,z
+

Is there a problem with adagrad?

Prof J.C. Kao, UCLA ECE


RMSProp
No(z c)
+

Ppc 0
=

90.99.999
=

0.01.g2
+

Prof J.C. Kao, UCLA ECE


RMSProp

Prof J.C. Kao, UCLA ECE


RMSProp

Prof J.C. Kao, UCLA ECE


RMSProp

Prof J.C. Kao, UCLA ECE


RMSProp

Prof J.C. Kao, UCLA ECE


RMSProp

Prof J.C. Kao, UCLA ECE


RMSProp + momentum

Prof J.C. Kao, UCLA ECE


RMSProp + momentum

Prof J.C. Kao, UCLA ECE


RMSProp + momentum

Prof J.C. Kao, UCLA ECE


RMSProp + momentum

Prof J.C. Kao, UCLA ECE


RMSProp + momentum

Prof J.C. Kao, UCLA ECE


RMSProp + momentum

Prof J.C. Kao, UCLA ECE


Adam

1st

Prof J.C. Kao, UCLA ECE


Adam with no bias correction

Prof J.C. Kao, UCLA ECE


Adam

Prof J.C. Kao, UCLA ECE


Adam

t- A

Prof J.C. Kao, UCLA ECE


Adam

Prof J.C. Kao, UCLA ECE


Adam

Prof J.C. Kao, UCLA ECE


Adam

Prof J.C. Kao, UCLA ECE


Adam

Prof J.C. Kao, UCLA ECE


Adam

Prof J.C. Kao, UCLA ECE


First order methods

510) J10t) (0
=
+
-

0t) DoJ10t)

Pt 0t
29
=
-

+ 1

3(8t 1) 3(0t)
+
=

(87
+
-

5g
-

8z)bt)

=
J(8t) -

3gig ↓

mo
>0

ot !Ex Prof J.C. Kao, UCLA ECE


Second order methods

~
~
·
J(O)

-
large step

↑V
&

I(8) small steps.


·

Prof J.C. Kao, UCLA ECE


Newton’s method

Prof J.C. Kao, UCLA ECE


Newton’s method

Prof J.C. Kao, UCLA ECE


Newton’s method NOT TESTED

Prof J.C. Kao, UCLA ECE


Quasi-Newton methods NOT TESTED

Prof J.C. Kao, UCLA ECE


Quasi-Newton methods NOT TESTED

Prof J.C. Kao, UCLA ECE


Conjugate gradients NOT TESTED

Prof J.C. Kao, UCLA ECE


Challenges in gradient descent

Prof J.C. Kao, UCLA ECE


Challenges in gradient descent

Prof J.C. Kao, UCLA ECE

You might also like