0% found this document useful (0 votes)
19 views27 pages

CS 747, Autumn 2023 - Lecture 3

Uploaded by

Kus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views27 pages

CS 747, Autumn 2023 - Lecture 3

Uploaded by

Kus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CS 747, Autumn 2023: Lecture 3

Shivaram Kalyanakrishnan

Department of Computer Science and Engineering


Indian Institute of Technology Bombay

Autumn 2023

1/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 1 / 10


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret

2/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 10


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret

UCB, KL-UCB algorithms


Thompson Sampling algorithm

2/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 10


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret

UCB, KL-UCB algorithms


Thompson Sampling algorithm

Concentration bounds
Analysis of UCB

Understanding Thompson Sampling


2/10
Other bandit problems
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 10
Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret

UCB, KL-UCB algorithms


Thompson Sampling algorithm

Concentration bounds
Analysis of UCB

Understanding Thompson Sampling


2/10
Other bandit problems
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 10
A Lower Bound on Regret
Paraphrasing Lai and Robbins (1985; see Theorem 2).
Let L be an algorithm such that for every bandit instance I ∈ Ī
and for every α > 0, as T → ∞:
RT (L, I) = o(T α ).

Then, for every bandit instance I ∈ Ī, as T → ∞:

RT (L, I) X p⋆ (I) − pa (I)


≥ ,
ln(T ) KL(pa (I), p⋆ (I))
a:pa (I)̸=p⋆ (I)

def
where for x, y ∈ [0, 1), KL(x, y ) = x ln yx + (1 − x) ln 1−y
1−x
.
3/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 10


Multi-armed Bandits

1. UCB, KL-UCB algorithms

2. Thompson Sampling algorithm

4/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 10


Upper Confidence Bounds = UCB (Auer
q et al., 2002)
2 ln(t)
- At time t, for every arm a, define ucbta = p̂at + uat
.
- p̂at is the empirical mean of rewards from arm a.
- uat the number of times a has been sampled at time t.

ucb at

pt
a

0 5/10
R
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 10
Upper Confidence Bounds = UCB (Auer
q et al., 2002)
2 ln(t)
- At time t, for every arm a, define ucbta = p̂at + uat
.
- p̂at is the empirical mean of rewards from arm a.
- uat the number of times a has been sampled at time t.
- Pull an arm a for which ucbta is maximum.
1

ucb at

pt
a

0 5/10
R
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 10
Upper Confidence Bounds = UCB (Auer
q et al., 2002)
2 ln(t)
- At time t, for every arm a, define ucbta = p̂at + uat
.
- p̂at is the empirical mean of rewards from arm a.
- uat the number of times a has been sampled at time t.
- Pull an arm a for which ucbta is maximum.
1

ucb at

pt
a

0 5/10
R
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 10
Upper Confidence Bounds = UCB (Auer
q et al., 2002)
2 ln(t)
- At time t, for every arm a, define ucbta = p̂at + uat
.
- p̂at is the empirical mean of rewards from arm a.
- uat the number of times a has been sampled at time t.
- Pull an arm a for which ucbta is maximum.
1

Achieves regret of O (log(T )):


ucb at optimal dependence on T
up to a constant factor.
pt
a

0 5/10
R
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 10
KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.

6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.
ucb-klta = max{q ∈ [p̂at , 1] s. t. uat KL(p̂at , q) ≤ ln(t) + c ln(ln(t))}, where c ≥ 3.

6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.
ucb-klta = max{q ∈ [p̂at , 1] s. t. uat KL(p̂at , q) ≤ ln(t) + c ln(ln(t))}, where c ≥ 3.
Equivalently, ucb-klta is the solution q ∈ [p̂at , 1] to KL(p̂at , q) = ln(t)+culn(ln(t))
t .
a

6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.
ucb-klta = max{q ∈ [p̂at , 1] s. t. uat KL(p̂at , q) ≤ ln(t) + c ln(ln(t))}, where c ≥ 3.
Equivalently, ucb-klta is the solution q ∈ [p̂at , 1] to KL(p̂at , q) = ln(t)+culn(ln(t))
t .
a

KL-UCB algorithm: at step t, pull argmaxa∈A ucb-klta .

6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.
ucb-klta = max{q ∈ [p̂at , 1] s. t. uat KL(p̂at , q) ≤ ln(t) + c ln(ln(t))}, where c ≥ 3.
Equivalently, ucb-klta is the solution q ∈ [p̂at , 1] to KL(p̂at , q) = ln(t)+culn(ln(t))
t .
a

KL-UCB algorithm: at step t, pull argmaxa∈A ucb-klta .

Observe that KL(p̂at , q) monotonically increases with q, and


▶ KL(p̂at , p̂at ) = 0;
▶ KL(p̂at , 1) = ∞.
Easy to compute ucb-klta numerically (for example through binary search).

6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.
ucb-klta = max{q ∈ [p̂at , 1] s. t. uat KL(p̂at , q) ≤ ln(t) + c ln(ln(t))}, where c ≥ 3.
Equivalently, ucb-klta is the solution q ∈ [p̂at , 1] to KL(p̂at , q) = ln(t)+culn(ln(t))
t .
a

KL-UCB algorithm: at step t, pull argmaxa∈A ucb-klta .

Observe that KL(p̂at , q) monotonically increases with q, and


▶ KL(p̂at , p̂at ) = 0;
▶ KL(p̂at , 1) = ∞.
Easy to compute ucb-klta numerically (for example through binary search).

ucb-klta is a tighter confidence bound than ucbta .


6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


KL-UCB (Garivier and Cappé, 2011)
Identical to UCB algorithm on previous slide, except for a different definition
of the upper confidence bound.
ucb-klta = max{q ∈ [p̂at , 1] s. t. uat KL(p̂at , q) ≤ ln(t) + c ln(ln(t))}, where c ≥ 3.
Equivalently, ucb-klta is the solution q ∈ [p̂at , 1] to KL(p̂at , q) = ln(t)+culn(ln(t))
t .
a

KL-UCB algorithm: at step t, pull argmaxa∈A ucb-klta .

Observe that KL(p̂at , q) monotonically increases with q, and


▶ KL(p̂at , p̂at ) = 0;
▶ KL(p̂at , 1) = ∞.
Easy to compute ucb-klta numerically (for example through binary search).

ucb-klta is a tighter confidence bound than ucbta .


Regret of KL-UCB asymptotically matches Lai and Robbins’ lower bound! 6/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 10


Multi-armed Bandits

1. UCB, KL-UCB algorithms

2. Thompson Sampling algorithm

7/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 10


Background: Beta Distribution
Beta(α, β) defined on [0, 1]. Two parameters: α and β.
α αβ
Mean = ; Variance = .
α+β (α + β)2 (α + β + 1)

3.0 α = 1, β = 1
α = 3, β = 4
α = 5, β = 15
2.5
Beta pdf(x)

2.0

1.5

1.0

0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0
x 8/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 10


Background: Beta Distribution
Beta(α, β) defined on [0, 1]. Two parameters: α and β.
α αβ
Mean = ; Variance = .
α+β (α + β)2 (α + β + 1)
0.6
3.0 α = 1, β = 1 µ = 0, σ = 2
α = 3, β = 4 µ = 0, σ = 3
α = 5, β = 15 0.5 µ = 5, σ = 1
2.5

Gaussian pdf(x)
0.4
Beta pdf(x)

2.0

0.3
1.5

1.0 0.2

0.5 0.1

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 -10.0 -5.0 0.0 5.0 10.0
x x 8/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 10


Thompson Sampling (Thompson, 1933)
- At time t, let arm a have sat successes (1’s/heads) and fat failures (0’s/tails).

9/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 10


Thompson Sampling (Thompson, 1933)
- At time t, let arm a have sat successes (1’s/heads) and fat failures (0’s/tails).
- Beta(sat + 1, fat + 1) represents a “belief” about the true mean of arm a.
t (sat +1)(fat +1)
- Mean = st s+f
a +1
t +2 ; variance = .
a a (sat +fat +2) 2 (sat +fat +3)
1

0
R 9/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 10


Thompson Sampling (Thompson, 1933)
- At time t, let arm a have sat successes (1’s/heads) and fat failures (0’s/tails).
- Beta(sat + 1, fat + 1) represents a “belief” about the true mean of arm a.
t (sat +1)(fat +1)
- Mean = st s+f
a +1
t +2 ; variance = .
a a (sat +fat +2) 2 (sat +fat +3)

- Computational step: For every arm 1

a, draw a sample (in agent’s mind)


xat ∼ Beta(sat + 1, fat + 1).
- Sampling step: Pull (in real world)
arm a for which xat is maximum.

0
R 9/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 10


Thompson Sampling (Thompson, 1933)
- At time t, let arm a have sat successes (1’s/heads) and fat failures (0’s/tails).
- Beta(sat + 1, fat + 1) represents a “belief” about the true mean of arm a.
t (sat +1)(fat +1)
- Mean = st s+f
a +1
t +2 ; variance = .
a a (sat +fat +2) 2 (sat +fat +3)

- Computational step: For every arm 1

a, draw a sample (in agent’s mind)


xat ∼ Beta(sat + 1, fat + 1).
- Sampling step: Pull (in real world)
arm a for which xat is maximum.

0
R 9/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 10


Thompson Sampling (Thompson, 1933)
- At time t, let arm a have sat successes (1’s/heads) and fat failures (0’s/tails).
- Beta(sat + 1, fat + 1) represents a “belief” about the true mean of arm a.
t (sat +1)(fat +1)
- Mean = st s+f
a +1
t +2 ; variance = .
a a (sat +fat +2) 2 (sat +fat +3)

- Computational step: For every arm 1

a, draw a sample (in agent’s mind)


xat ∼ Beta(sat + 1, fat + 1).
- Sampling step: Pull (in real world)
arm a for which xat is maximum.

Achieves optimal regret (Kaufmann


et al., 2012); is excellent in practice
(Chapelle and Li, 2011). 0
R 9/10

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 10


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret

UCB, KL-UCB algorithms


Thompson Sampling algorithm

Concentration bounds
Analysis of UCB

Understanding Thompson Sampling


10/10
Other bandit problems
Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 10

You might also like