0% found this document useful (0 votes)

4 views91 pages

Mod8 Slides

Lecture 14 focuses on Monte Carlo Tree Search (MCTS) in the context of reinforcement learning, highlighting its application in achieving significant advancements in AI, particularly in games like Go. The lecture covers concepts such as simulation-based search, the structure of MCTS, and its advantages over traditional methods, including its ability to dynamically evaluate states and handle complex decision-making. Key components discussed include the Upper Confidence Tree (UCT) search method and the role of self-play in training models like AlphaGo and AlphaZero.

Uploaded by

Ananth Vankipuram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views91 pages

Mod8 Slides

Uploaded by

Ananth Vankipuram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Lecture 14: Monte Carlo Tree Search

Emma Brunskill

CS234 Reinforcement Learning.

With many slides from or derived from David Silver

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 1 / 42
Refresh Your Understanding
Select all that are true:
Upper confidence bounds are used to balance exploration and
leveraging the acquired information to achieve high reward
These algorithms can be used in bandits and Markov decision
processes
If the reward model is known, there is no benefit to using an upper
confidence bound algorithm

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 2 / 42
Refresh Your Understanding

Select all that are true:

Upper confidence bounds are used to balance exploration and
leveraging the acquired information to achieve high reward
These algorithms can be used in bandits and Markov decision
processes
If the reward model is known, there is no benefit to using an upper
confidence bound algorithm
True. True. Depends on setting. In bandits, no additional gain. In RL, if
the dynamics model is not known, there will be a gain.

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 3 / 42
Class Structure

Last time: Fast / sample efficient Reinforcement Learning

This Time: MCTS
Next time: Rewards in RL

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 4 / 42
AlphaZero and Monte Carlo Tree Search

Responsible in part for one of the greatest achievements in AI in the

last decade– becoming a better Go player than any human
Incorporates a number of interesting ideas

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 5 / 42
Table of Contents

1 Simulation-Based Search

2 AlphaZero

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 6 / 42
Computing Action for Current State Only

So far in class, compute a policy for whole state space

Key idea: can prioritize some additional local computation to make a
better decision for right now

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 7 / 42
Simple Monte-Carlo Search

Given a model Mv and a simulation policy π

For each action a ∈ A
Simulate K episodes from current (real) state st
k
{st , a, Rt+1 , ..., STk }K
k=1 ∼ Mv , π

Evaluate actions by mean return (Monte-Carlo evaluation)

K
1 X P
Q(st , a) = Gt −
→ qπ (st , a) (1)
K
k=1

Select current (real) action with maximum value

at = argmax Q(st , a)
a∈A

This is essentially doing 1 step of policy improvement

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 8 / 42
Simulation-Based Search

Simulate episodes of experience from now with the model

Apply model-free RL to simulated episodes

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 9 / 42
Expectimax Tree

Can we do better than 1 step of policy improvement?

If have a MDP model Mv
Can compute optimal Q(s, a) values for current state by constructing
an expectimax tree

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 10 / 42
Forward Search Expectimax Tree
Forward search algorithms select the best action by lookahead
They build a search tree with the current state st at the root
Using a model of the MDP to look ahead

No need to solve whole MDP, just sub-MDP starting from now

Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 11 / 42
Expectimax Tree

Can we do better than 1 step of policy improvement?

If have a MDP model Mv
Can compute optimal q(s, a) values for current state by constructing
an expectimax tree
Limitations: Size of tree scales as (|S||A|)H

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 12 / 42
Monte-Carlo Tree Search (MCTS)

Given a model Mv
Build a search tree rooted at the current state st
Samples actions and next states
Iteratively construct and update tree by performing K simulation
episodes starting from the root state
After search is finished, select current (real) action with maximum
value in search tree
at = argmax Q(st , a)
a∈A

Check your understanding: How does this differ from Monte Carlo
Simulated Search?

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 13 / 42
Check Your Understanding: MCTS

MCTS involves deciding on an action to take by doing tree search

where it picks actions to maximize Q(S, A) and samples states.
Select all
1 Given a MDP, MCTS may be a good choice for short horizon problems
with a small number of states and actions.
2 Given a MDP, MCTS may be a good choice for long horizon problems
with a large action space and a small state space
3 Given a MDP, MCTS may be a good choice for long horizon problems
with a large state space and small action space
4 Not sure

FFT

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 14 / 42
Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 15 / 42
Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

UCT: borrow idea from bandit literature and treat each node where
can select actions as a multi-armed bandit (MAB) problem
Maintain an upper confidence bound over reward of each arm

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 16 / 42
Upper Confidence Tree (UCT) Search
How to select what action to take during a simulated episode?
UCT: borrow idea from bandit literature and treat each node where
can select actions as a multi-armed bandit (MAB) problem
Maintain an upper confidence bound over reward of each arm at a
node
N(i,a)
s
1 X O(log N(i))
Q(s, a, i) = Gk (i, a) + c
N(i, a) N(i, a)
k=1
where N(i, a) is the number of times selected arm a at node i,
Gk (i, a) is the k-th return (discounted sum of rewards) from node i
following action a, and
For simulated episode k at node i, select action/arm with highest
upper bound to simulate and expand (or evaluate) in the tree
aik = arg max Q(s, a, i)
This implies that the policy used to simulate episodes with (and
expand/update the tree) can change across each episode
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 17 / 42
Advantages of MC Tree Search

Highly selective best-first search

Evaluates states dynamically (unlike e.g. DP)
Uses sampling to break curse of dimensionality
Works for “black-box” models (only requires samples)
Computationally efficient, anytime, parallelisable

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 18 / 42
Table of Contents

1 Simulation-Based Search

2 AlphaZero

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 19 / 42
AlphaGo

AlphaGo trailer link

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 20 / 42
Case Study: the Game of Go

Go is 2500 years old

Hardest classic board game
Grand challenge task (John
McCarthy)
Traditional game-tree search
has failed in Go
Check your understanding:
does playing Go involve
learning to make decisions in
a world where dynamics and
reward model are unknown?

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 21 / 42
Rules of Go

Usually played on 19x19, also 13x13 or 9x9 board

Simple rules, complex strategy
Black and white place down stones alternately
Surrounded stones are captured and removed
The player with more territory wins the game

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 22 / 42
AlphaGo and AlphaZero

Self Play
Strategic Computation
Highly selective best-first search
Power of Averaging
Local Computation
Learn and Update Heuristics

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 23 / 42
Self Play for Go

Key idea: have agent play itself

Game operates by computing best move at current state, then, for
opponent move, doing the same
Bottleneck is only computation, no humans needed
Self-play also provides a well-matched player
Check your understanding: how does this help with policy training?
What is the reward density?

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 24 / 42
Self Play for Go: Solution

Key idea: have agent play itself

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 25 / 42
Selecting a Move in a Single Game: Start at Root1

Inspired by Upper Confidence Tree Search but many changes

1
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 26 / 42
Selecting a Move in a Single Game: Repeatedly Expand2

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 27 / 42
Selecting a Move in a Single Game: Note Using Network
Predictions for Action Probabilities3

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 28 / 42
Selecting a Move in a Single Game: At Leaf, Plug in
Network Predictions for Value4

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 29 / 42
Selecting a Move in a Single Game: Update Ancestors5

5
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 30 / 42
Selecting a Move in a Single Game: Repeat Many Times6

Repeat roll out and backup process many times

Note: inside the network alternating whether opponent or agent is
”maximizing” its value. Therefore tree is mimicking a min-max tree
At end, compute a policy for root node by
1
π(s) ∝ N(s, a) τ (2)
6
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 31 / 42
Self Play a Game7

Select an action according to root policy, take action, and repeat

whole process
Repeat until game ends* and observe a win or loss

7
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 32 / 42
Train Neural Network to Predict Policies and Values
footnoteImages from Silver et al. Nature 2017

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 33 / 42
AlphaGo and AlphaZero: Recap and Evaluation

Features:
Self Play
Strategic Computation
Highly selective best-first search
Power of Averaging
Local Computation
Learn and Update Heuristics
Evaluation Questions
What is the influence of architecture?
What is the impact of using MCTS (on top of learning a policy / value
function)?
How does it compare to human play or using human play?

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 34 / 42
Impact of Architecture8

8
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 35 / 42
Impact of MCTS9

9
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 36 / 42
Overall performance10

10
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 37 / 42
Need for Human Data?11

11
Images from Silver et al. Nature 2017
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 38 / 42
In more depth: Upper Confidence Tree (UCT) Search

UCT: borrow idea from bandit literature and treat each tree node
where can select actions as a multi-armed bandit (MAB) problem
Maintain an upper confidence bound over reward of each arm and
select the best arm
Check your understanding: Why is this slightly strange? Hint: why
were upper confidence bounds a good idea for exploration/
exploitation? Is there an exploration/ exploitation problem during
simulated episodes?12

12
Relates to metalevel reasoning (for an example related to Go see ”Selecting
Computations: Theory and Applications”, Hay, Russell, Tolpin and Shimony 2012)
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 39 / 42
Check Your Understanding: UCT Search

In Upper Confidence Tree (UCT) search we treat each tree node as a

multi-armed bandit (MAB) problem, and use an upper confidence
bound over the future value of each action to help select actions for
later rollouts. Select all that are true
1 This may be useful since it will prioritize actions that lead to later good
rewards
2 UCB minimizes regret. UCT is minimizing regret within rollouts of the
tree. (If this is true, think about if this a good idea?)
3 Not sure

T. T (but not a good idea)

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 40 / 42
In more depth: Upper Confidence Tree (UCT) Search

UCT: borrow idea from bandit literature and treat each tree node
where can select actions as a multi-armed bandit (MAB) problem
Maintain an upper confidence bound over reward of each arm and
select the best arm
Hint: why were upper confidence bounds a good idea for exploration/
exploitation? Is there an exploration/ exploitation problem during
simulated episodes?13

13
Relates to metalevel reasoning (for an example related to Go see ”Selecting
Computations: Theory and Applications”, Hay, Russell, Tolpin and Shimony 2012)
Emma Brunskill (CS234 Reinforcement Learning. Lecture
) 14: Monte Carlo Tree Search Spring 2024 41 / 42
Class Structure

Last time: Fast Learning

This Time: MCTS
Next time: Rewards in RL

Emma Brunskill (CS234 Reinforcement Learning. Lecture

) 14: Monte Carlo Tree Search Spring 2024 42 / 42
Lecture 15 Rewards in RL

Emma Brunskill

CS234 Reinforcement Learning.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 1/9
Refresh Your Understanding

Select all that are true:

Direct Preference Optimization assumes human preferences follow a
Bradley Terry model
RLHF can be used with reward models learned from preferences or
reward models learned from people labeling rewards
Asking people to provide preference pair rankings is likely to be an
efficient way to learn the reward model for board games
DPO and RLHF can be used with extremely large policy networks
Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 2/9
Refresh Your Understanding

Select all that are true:

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 3/9
Class Structure

Last time: MCTS

Today: Rewards in RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 4/9
Refresh Your Understanding 2

Select all that are true:

Monte Carlo Tree Search approximates a forward search tree
MCTS tackles the action branching function through sampling
AlphaZero uses two networks, one to help prioritize across actions,
and one to provide an estimate of the value at leaves
Doing additional guided Monte Carlo tree search when computing an
action significantly improved the test time performance of AlphaZero
Self play provides a form of curriculum learning
Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 6/9
Refresh Your Understanding 2

Select all that are true:

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 7/9
Selecting a Move in a Single Game: Repeatedly Expand1

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 8/9
Selecting a Move in a Single Game: Note Using Network
Predictions for Action Probabilities2

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15 Rewards in RL Spring 2024 9/9
V A L U E
A L I G N M E N T

DAN WEBBER, PHD

Wait, who’s this “Dan” guy?
• Postdoc, HAI and EIS at Stanford
• Embedding ethics into CS courses like this one!

• PhD in Philosophy, University of Pittsburgh

• Dissertation on moral theory
• Basically, trying to think systematically about value

• BA in Computer Science, Amherst College

• Plus a few years as a software developer in fintech and e-commerce

2
Value (mis)alignment: an example
Paperclip AI (Bostrom 2016): “An AI, designed to manage
production in a factory, is given the final goal of maximizing the
manufacture of paperclips…”
“… and proceeds by converting first the Earth and then
increasingly large chunks of the observable universe into
paperclips.”

Even a less powerful AI might pursue this goal in surprising

ways.
3
Value alignment: the problem
How do we design AI agents that will do what we really want?

What we really want is often much more nuanced than what we

say we want. Humans work with many background assumptions
that are (1) hard to formalize and (2) easy to take for granted.

It’s hard to solve this problem just by giving better instructions!

• Compare the difficulty in manually specifying reward functions.
• Even worse for AI that takes instructions from non-expert users!
4
Precisifying the problem
There are several ways of interpreting “what we really want”!

First, value alignment might be the problem of designing AI

agents that do what we really intend for them to do.

If this is right, Paperclip AI is an example of value misalignment

because the AI failed to derive the user’s true intention
(maximize production subject to certain constraints) from their
instruction (maximize production).
5
Aligning to user intentions
The solution, then, would be to design AI systems that
successfully translate from underspecified instructions to fully
specified intentions (incl. unspoken constraints, conditions, etc.)

“This is a significant challenge. To really grasp the intention

behind instructions, AI may require a complete model of human
language and interaction, including an understanding of the
culture, institutions, and practices that allow people to
understand the implied meaning of terms.” (Gabriel 2020)
6
Aligning to user intentions
A philosophical problem: our intentions don’t always track what
we really want.

Classic cases: incomplete information, imperfect rationality

Suppose I intend for the AI to maximize paperclip production

(subject to constraints) because I want to maximize return on
my investment in the factory. If the AI knows that I would get a
better return by producing something else, has it given me
what I really want if it does what I intend?
7
Aligning to revealed preferences
Second interpretation: AI agent is value-aligned if it does what
the user prefers.
• Paperclip AI is misaligned because I prefer it not destroy the world!

Problem: How can the AI know what the user prefers when that
differs from the intentions expressed by the user?

Solution: The AI could infer the user’s preferences from the

user’s behavior or feedback.
8
Aligning to revealed preferences
Technical challenges:
• Requires agent to train on observation of user or from user feedback
• Infinitely many preference/reward functions consistent with finite
behavior/feedback
• Hard to infer preferences about unexpected situations (e.g.,
emergencies)
Philosophical problem:
• Just as my intentions can diverge from my preferences, my
preferences can diverge from what is actually good for me.

9
Aligning to user’s best interests
Third interpretation: AI agent is value-aligned if it does what is
in the user’s best interests, objectively speaking.
• Paperclip AI is misaligned because it is objectively bad for me for the
world to be destroyed.

Technical/philosophical problem: Unlike the intended meaning

of my instruction or my revealed preferences, my objective best
interests can’t be determined empirically. What’s objectively
good for me is a philosophical question, not a scientific one.
10
Aligning to user’s best interests
The bad news is that philosophers disagree about what’s
objectively good for a person:
• Is it just the person’s own pleasure or happiness?
• … or the satisfaction of the person’s desires or preferences?
• … or are things like health, safety, knowledge, relationships, etc.
objectively good for us even if we don’t enjoy or prefer them?
The good news is that there’s a lot of agreement:
• Health, safety, liberty, knowledge, social relationships, purpose,
dignity, happiness… almost everyone agrees that these things are at
least usually good for the person who has them
11
Aligning to user’s best interests
One thing that is widely thought to be good for a person is
autonomy: the ability to choose for yourself how to live your
life, even if you don’t always make the best choice.

We want to avoid paternalism: choosing what you think is best

for someone rather than letting her choose for herself.

Even if we align to users’ best interests, then, users’ interests in

autonomy might give us reason to consider their intentions or
preferences, even when these conflict with their other interests.
12
Part-way recap
Value alignment is the problem of designing AI agents that will
do what we really want them to do.

This could mean doing what we really intend, or what we really

prefer, or what would really be in our best interest.

These are not always the same thing, and each option poses
unique technical and philosophical problems for alignment.

13
Case study: LLM chatbot personalization
Everyone who talks to ChatGPT is talking to the same chatbot.
But many chatbot providers now offer a wide range of different
chatbots with different personas. Often these personas are
crafted by users:

14
Case study: LLM chatbot personalization
Imagine you are building an LLM chatbot to serve as a source of
news for users.
• In what ways might you make the chatbot personalizable if you
wanted to align to users’ preferences?
• In what ways might you make the chatbot personalizable if you
wanted to align to users’ best interests?
• What would be the pros and cons of each approach?

Discuss!

15
WHAT’S BEEN MISSING
FROM OUR
DISCUSSION SO FAR?

16
PEOPLE
OTHER
THAN THE
USER!

17
Aligning to morality
Fourth interpretation: AI agent is value-aligned if it does what is
morally right.
• Paperclip AI is misaligned because it’s bad for everyone if the world is
destroyed!

This interpretation emphasizes the we in “what we really want.”

What the user intends, prefers, or even what’s in her interest

might be bad for others!
18
Aligning to morality
But it wasn’t just a waste of time to start by focusing on the user!

Even though we want to align to morality, we also want to align

to what the user wants when what the user wants is morally
acceptable.

So it still matters how we think about what the user really wants,
even if we need to think about it in the larger moral context.

19
Aligning to morality
A philosophical problem: Which things really are morally right?

There’s a lot of disagreement on this one too!

• Is it right to lie to spare someone’s feelings?
• Is it right to pirate copyrighted material?
• Is it right to buy luxuries when you could donate to charity instead?
• Is it right to kill one person to save five?
… a thousand?
… a million?
20
Aligning to the best moral theory
A moral theory is a systematic account of morality that aims to
answer questions like these.
• For example, consequentialism says that an act is right iff it produces
the greatest net good of any act available.
• Is it right to lie to spare someone’s feelings? It could be, if you can get
away with it!

Idea: align AI agents to the correct or best moral theory.

• If that theory were consequentialism, value alignment would be
about training AI agents to do what they can to maximize net good.

21
Aligning to the best moral theory
A now familiar problem: philosophers disagree about what the
best moral theory is, or even if there is one!

Prioritarianism: Produce the greatest weighted sum of good,

with the interests of those who are worse off given more weight.

Maximin: Make things as good as possible for the worst off.

22
Aligning to the best moral theory
Satisficing: Produce a sufficiently great (weighted) sum of good.

Deontology: Even acts with good consequences can be wrong

if they violate certain moral rules or rights.
• Common deontological rules: don’t murder anyone, don’t steal, don’t
lie, keep your promises, etc.
• Rules/rights themselves might be justified by their consequences!

23
Aligning to the best moral theory
Another problem: even if we knew the best moral theory, it
might be bad to design AI agents to act on moral values that
their users don’t share.

This might be bad for moral reasons (avoiding paternalism) or

more practical ones (users might not trust AI agent, might try to
oppose it, etc.)

24
Aligning to common-sense morality
There’s lots of moral disagreement, but also a lot of agreement.

Idea #2: align AI agents to what we might call common-sense

morality: the common-sense moral ideas that most people
agree on.

Instead of trying to make AI morally perfect, we just try to make

it make moral decisions like a regular person would.
• This probably ends up being pretty deontological and satisficing!
25
Common-sense morality and predictability
One advantage of aligning to common-sense morality rather
than a particular moral theory is that moral theories often have
surprising implications.
• Can you think of a surprising implication of the consequentialist
requirement to maximize net good?
• How about a deontological rule against lying?

An AI agent aligned to a particular moral theory might act on

surprising implications we haven’t even noticed yet!
26
Common-sense morality and predictability
By contrast, an AI agent aligned to common-sense morality
would behave more predictably, making moral decisions like a
regular human.

But it might be unpredictable in edge cases, where common

sense arguably runs out.
• Would an AI aligned to common-sense morality kill one person to
save a million?

Is it bad if the AI is unsure about cases we’re also unsure about?

27
Recap
Value alignment is the problem of designing AI agents that will
do what we really want them to do.

This could mean doing what we really intend, or what we really

prefer, or what would really be in our best interest.
• We saw what the difference might look like in chatbot development.

It definitely means doing what’s morally right.

• But this could mean doing what’s morally right in theory, or just
abiding by the common-sense morality that most people share.
28
Reinforcement Learning

Learning through experience/data to make good decisions under

uncertainty

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 2 / 20
High Level Learning Goals1

Define the key features of RL

Given an application problem know how (and whether) to use RL for it
Implement (in code) common RL algorithms

Describe (list and define) multiple criteria for analyzing RL algorithms and
evaluate algorithms on these metrics: e.g. regret, sample complexity,
computational complexity, empirical performance, convergence, etc.
Describe the exploration vs exploitation challenge and compare and contrast
at least two approaches for addressing this challenge (in terms of
performance, scalability, complexity of implementation, and theoretical
guarantees)

1
For more detailed descriptions, see website
Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 3 / 20
Revisiting Motivating Domains from First Lecture

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 4 / 20
CYU: Answer For One of These Domains

Which domain are you choosing?

Is this problem a bandit? A multi-step RL problem?
Is the problem online / offline or some combination?
What might the state / action / rewards be?
What algorithms might be useful here?

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 5 / 20
AlphaTensor. Fawzi et al. 2022

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 6 / 20
Revisiting: Learning Plasma Control for Fusion Science2

2
Image credits: DeepMind & SPC/EPFL. Degrave et al. Nature 2022
https://www.nature.com/articles/s41586-021-04301-9
Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 7 / 20
Revisiting: Efficient and targeted COVID-19 border testing
via RL 3

3
Bastani et al. Nature 2021
https://www.nature.com/
Professor a r t i c l e s Review
Emma Brunskill (CS234 RL) / s 4 1 5and8 6Wrap
- 021-04014-
Up z Spring 2024 12 / 20
Revisiting: ChatGPT
(https://openai.com/blog/chatgpt/ )

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 9 / 20
Reinforcement Learning

Learn a policy π(a|s) from data to optimize future expected reward

Optimization, delayed consequences, exploration, generalization
Actions impact data distribution: rewards observed and states reached

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 10 / 20
Reinforcement Learning: Standard Settings

State dependence
Bandits: next state independent of prior state and action
General decision process: next state depends on prior states and actions
Online/Offline
Offline / batch: Learn from historical data only
Online: Agent / algorithm can actively gather its own data

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 11 / 20
Reinforcement Learning: Core Ideas4

Function approximation + Offpolicy learning is a key challenge

4
These align closely with many of the core points of Chelsea Finn’s Deep RL course
summary slides
Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 16 / 20
Reinforcement Learning: Core Ideas5

Function approximation + Offpolicy learning is a key challenge

New policy introduces new distribution over (s,a,r)
Important because want data efficient RL in complex domains
PPO: Control with clipping
DAGGER: mitigate by obtaining more expert labels
Pessimistic Q Learning / CQL / MOPO: introduce pessimism into
offline RL
Models, values and policies
Models: easier to represent uncertainty (why?), useful for MCTS
Q function: summarizes performance of policy & and implies policy
Policies: the main target of most RL applications
Computational vs Data Efficiency
Data efficient techniques often very computationally intensive
In some domains, data = computation (e.g. simulated settings)

5
These align closely with many of the core points of Chelsea Finn’s Deep RL course
summary slides
Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 17 / 20
Open Challenges

Practical, robust RL
Robust/stable: Need for automatic hyperparameter tuning, model
selection, and generally robust methods for off-the-shelf RL
Efficiency: Need for data and computationally efficient methods
Hybrid offline-online:
Framing the problem
Alternate formulations to Markov decision processes?
Multi-task vs single task?
Alternate forms of feedback?
Stochastic vs adversarial vs cooperative decision process?
Continuous learning + planning vs system identification then planning?
Advancing data-driven decision making in domains that could benefit

Professor Emma Brunskill (CS234 RL) Review and Wrap Up Spring 2024 14 / 20

MCTS Katef
No ratings yet
MCTS Katef
56 pages
MCTSintro BR
No ratings yet
MCTSintro BR
33 pages
4.2. Adversarial Search and Games
No ratings yet
4.2. Adversarial Search and Games
32 pages
Lecture 3 - Real Time Dynamic Programming & Monte Carlo Tree Search
No ratings yet
Lecture 3 - Real Time Dynamic Programming & Monte Carlo Tree Search
21 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
12 pages
16.MCTS Tutorial
No ratings yet
16.MCTS Tutorial
28 pages
2024 Mth058 Lecture06 Mcts
100% (1)
2024 Mth058 Lecture06 Mcts
38 pages
NeurIPS 2019 Maximum Entropy Monte Carlo Planning Paper
No ratings yet
NeurIPS 2019 Maximum Entropy Monte Carlo Planning Paper
9 pages
Monte Carlo Search Tree
No ratings yet
Monte Carlo Search Tree
11 pages
AI Lec11
100% (1)
AI Lec11
33 pages
Ai3 1
No ratings yet
Ai3 1
23 pages
Monte Carlo Tree Search
No ratings yet
Monte Carlo Tree Search
19 pages
Paper 49
No ratings yet
Paper 49
8 pages
Power Mean Estimation in Stochastic Monte-Carlo Tree Search: Tuan Dam
No ratings yet
Power Mean Estimation in Stochastic Monte-Carlo Tree Search: Tuan Dam
25 pages
Applsci 11 02056 v2
No ratings yet
Applsci 11 02056 v2
18 pages
Optimal Control and Planning
No ratings yet
Optimal Control and Planning
46 pages
Session 04 Monte Carlo Tree Search
No ratings yet
Session 04 Monte Carlo Tree Search
28 pages
Alpha Zero
100% (1)
Alpha Zero
23 pages
Lecture 7 - Expectmax and Mcts
No ratings yet
Lecture 7 - Expectmax and Mcts
40 pages
InvitedTutorial CP12
No ratings yet
InvitedTutorial CP12
106 pages
MCTS and AlphaGo
100% (1)
MCTS and AlphaGo
58 pages
Alpha Zero Connectx
100% (1)
Alpha Zero Connectx
8 pages
FALLSEM2024-25 BEEE411L TH CH2024250101504 Reference Material I 02-09-2024 Module - 4 CSP
No ratings yet
FALLSEM2024-25 BEEE411L TH CH2024250101504 Reference Material I 02-09-2024 Module - 4 CSP
22 pages
Monte Carlo Tree Search
No ratings yet
Monte Carlo Tree Search
3 pages
Mastering The Game of Go With Deep Neural Networks and Tree Search - Nature - Nature Research
100% (1)
Mastering The Game of Go With Deep Neural Networks and Tree Search - Nature - Nature Research
15 pages
Accessing Gpt-4 Level Mathematical Olympiad Solutions Via Monte Carlo Tree Self-Refine With Llama-3 8B
No ratings yet
Accessing Gpt-4 Level Mathematical Olympiad Solutions Via Monte Carlo Tree Self-Refine With Llama-3 8B
12 pages
AI Game Strategy Essentials
No ratings yet
AI Game Strategy Essentials
57 pages
Monte Carlo Tree Search
No ratings yet
Monte Carlo Tree Search
8 pages
Expect I Max
No ratings yet
Expect I Max
51 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Adversarial Search
No ratings yet
Adversarial Search
27 pages
05 Games
No ratings yet
05 Games
42 pages
AlphaGo Paper
No ratings yet
AlphaGo Paper
20 pages
Unit 3 & 4 PDF Ai
No ratings yet
Unit 3 & 4 PDF Ai
70 pages
Monte Carlo Tree Self Refine For Math Llama 3
No ratings yet
Monte Carlo Tree Self Refine For Math Llama 3
12 pages
Mcts Survey Master PDF
No ratings yet
Mcts Survey Master PDF
49 pages
Monte-Carlo Tree Search: Alan Fern
No ratings yet
Monte-Carlo Tree Search: Alan Fern
51 pages
Monte-Carlo Tree Search Guide
No ratings yet
Monte-Carlo Tree Search Guide
51 pages
Monte Carlo Tree Search
No ratings yet
Monte Carlo Tree Search
15 pages
Reinforcement Learning With A and A Deep Heuristic: Ariel Kesleman Sergey Ten Adham Ghazali Majed Jubeh
No ratings yet
Reinforcement Learning With A and A Deep Heuristic: Ariel Kesleman Sergey Ten Adham Ghazali Majed Jubeh
6 pages
Game AI and Strategy Analysis
No ratings yet
Game AI and Strategy Analysis
53 pages
Dicussion On O1 Series 2
No ratings yet
Dicussion On O1 Series 2
38 pages
1 PB
No ratings yet
1 PB
7 pages
Monte-Carlo Planning with UCT
No ratings yet
Monte-Carlo Planning with UCT
12 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Monte Carlo Tree Search A Review of Recent Modifications and Applications
No ratings yet
Monte Carlo Tree Search A Review of Recent Modifications and Applications
99 pages
Alpha Go Nature Paper
100% (1)
Alpha Go Nature Paper
20 pages
CS335 Introduction To AI: Francisco Iacobelli June 25, 2015
No ratings yet
CS335 Introduction To AI: Francisco Iacobelli June 25, 2015
49 pages
Monte Carlo Tree Search Method For AI Games: Volume 2, Issue 2, March - April 2013
No ratings yet
Monte Carlo Tree Search Method For AI Games: Volume 2, Issue 2, March - April 2013
6 pages
ITSC6121 Lecture 4 - Game Trees I
No ratings yet
ITSC6121 Lecture 4 - Game Trees I
38 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Unit III - Game Theory
No ratings yet
Unit III - Game Theory
39 pages
Improving Robustness of Alphazero Algorithms To Test-Time Environment Changes
No ratings yet
Improving Robustness of Alphazero Algorithms To Test-Time Environment Changes
20 pages
Mod4 Slides
No ratings yet
Mod4 Slides
116 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
Mod7 Slides
No ratings yet
Mod7 Slides
44 pages
Mod1 Slides
No ratings yet
Mod1 Slides
34 pages
Calculus: Functions & Graphs
No ratings yet
Calculus: Functions & Graphs
24 pages
Burszta-Adamiak Licznar Criteria 1 2019
No ratings yet
Burszta-Adamiak Licznar Criteria 1 2019
11 pages
Sheet 4
No ratings yet
Sheet 4
3 pages
Solid Geometry Basics & Formulas
No ratings yet
Solid Geometry Basics & Formulas
9 pages
Algebraic Expressions, Equations, Inequalities
No ratings yet
Algebraic Expressions, Equations, Inequalities
15 pages
Sample Problems and Solutions
No ratings yet
Sample Problems and Solutions
9 pages
Sequential Circuit Design Guide
No ratings yet
Sequential Circuit Design Guide
20 pages
Navier-Stokes Solution Guide
No ratings yet
Navier-Stokes Solution Guide
5 pages
Class 7 Math Olympiad Guide
No ratings yet
Class 7 Math Olympiad Guide
5 pages
18CSC162J DSA Unit1
No ratings yet
18CSC162J DSA Unit1
77 pages
Junior Math Challenge Solutions
No ratings yet
Junior Math Challenge Solutions
13 pages
Chapter - 3 Research Setting and Research Design
No ratings yet
Chapter - 3 Research Setting and Research Design
12 pages
4 Inequalities and Equations
No ratings yet
4 Inequalities and Equations
39 pages
Magnetism Quiz by Ramakant Sir
No ratings yet
Magnetism Quiz by Ramakant Sir
5 pages
Advanced Calculus: Integrating Factors
No ratings yet
Advanced Calculus: Integrating Factors
3 pages
Advanced CAD Lab Guide
0% (1)
Advanced CAD Lab Guide
66 pages
Math CC Alignment Guide Grade 4
No ratings yet
Math CC Alignment Guide Grade 4
6 pages
2025 Entrance Mathematics
No ratings yet
2025 Entrance Mathematics
9 pages
Fair Weather Atmosphere For Power Generation
No ratings yet
Fair Weather Atmosphere For Power Generation
14 pages
Quantum Physics Student Guide
No ratings yet
Quantum Physics Student Guide
3 pages
Quiz 9 - Chap 10
No ratings yet
Quiz 9 - Chap 10
3 pages
CWC (Mag Tra &
No ratings yet
CWC (Mag Tra &
15 pages
11 CIE IGCSE Additional Mathematics Paper 2 Topical Past Paper Permutations and Combinations 2
No ratings yet
11 CIE IGCSE Additional Mathematics Paper 2 Topical Past Paper Permutations and Combinations 2
20 pages
A Course in Behavioral Economics Erik Angner available any format
No ratings yet
A Course in Behavioral Economics Erik Angner available any format
128 pages
Module 1: Numbers and Number Sense Lesson 1: SETS Learning Competency 3.3: Use Venn Diagram To Represent Sets, Subsets and Set Operations. I - Objectives
No ratings yet
Module 1: Numbers and Number Sense Lesson 1: SETS Learning Competency 3.3: Use Venn Diagram To Represent Sets, Subsets and Set Operations. I - Objectives
10 pages
First Sem UNIT 5 - Matrices and System of Equations 2024 - 2025
No ratings yet
First Sem UNIT 5 - Matrices and System of Equations 2024 - 2025
10 pages
Lesson 04 Graphical Solution To Al PP
No ratings yet
Lesson 04 Graphical Solution To Al PP
13 pages
Aerodynamics in High Performance Vehicles
100% (1)
Aerodynamics in High Performance Vehicles
20 pages
Space Complexity in Algorithms
No ratings yet
Space Complexity in Algorithms
4 pages
Life and Words Violence and The Descent Into The Ordinary Philip E Lilienthal Books 1st Edition Veena Das Online PDF
100% (5)
Life and Words Violence and The Descent Into The Ordinary Philip E Lilienthal Books 1st Edition Veena Das Online PDF
97 pages