9 Sqoop Notes
9 Sqoop Notes
fit a model to
estimate return
generate samples
(i.e. run the policy)
training supervised
data learning
Example: Gaussian policies
What did we just do?
high variance
slow convergence
hard to choose learning rate
• Partial observability
• Works just fine
• What is wrong with policy gradient?
Break
Reducing variance
“reward to go”
a convenient identity
Baselines
average reward is not the best baseline, but it’s pretty good!
Analyzing variance
• Baselines generate
samples (i.e.
• Unbiased! run the policy)
Maximum likelihood:
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)
Policy gradient with automatic differentiation
Pseudocode example (with discrete actions):
Policy gradient:
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)
q_values
Policy gradient in practice
• Remember that the gradient has high variance
• This isn’t the same as supervised learning!
• Gradients will be really noisy!
• Consider using much larger batches
• Tweaking learning rates is very hard
• Adaptive step size rules like ADAM can be OK-ish
• We’ll learn about policy gradient-specific learning rate adjustment methods
later!
Review
• Policy gradient is on-policy
• Can derive off-policy variant
fit a model to
• Use importance sampling estimate return
• Exponential scaling in T
generate
• Can ignore state portion samples (i.e.
(approximation) run the policy)
to backpropagate
• Practical considerations: batch size,
learning rates, optimizers
Advanced policy gradient topics
• Incorporate example
demonstrations using
importance sampling
• Neural network policies