Q-Learning
Q-Learning 的代码部分和SARSA大同小异, 最大的不同是SARSA是on-policy,但是Q-Learning是off-policy。
import gym
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import pandas as pd
import time
from IPython.display import clear_output
env = gym.make('Taxi-v3')
class Q_LearningAgent:
def __init__(self, env, epsilon=0.1, gamma=0.9, learning_rate=0.1):
self.epsilon = epsilon
self.gamma = gamma
self.learning_rate = learning_rate
self.action_n = env.action_space.n
self.q_table = np.zeros((env.observation_space.n, env.action_space.n))
def execute_epsilon_greedy_policy(self, state):
if np.random.uniform() > self.epsilon:
action = np.argmax(self.q_table[state])
else:
action = np.random.randint(self.action_n)
return action
def learn(self,state, action, reward, next_state, done):
td_target = reward + self.gamma * np.max(self.q_table[next_state]) * (1.0 - done)
td_error = td_target - self.q_table[state][action]
self.q_table[state, action] += self.learning_rate * td_error
在 learn
方法中,Q-learning的目标策略是贪婪策略,而不是
ϵ
−
g
r
e
e
d
y
p
o
l
i
c
y
\epsilon-greedy \ policy
ϵ−greedy policy 这也是Q-Learning作为off-policy的体现。即:
目标策略采用贪婪策略:
π
(
S
t
+
1
)
=
argmax
a
′
Q
(
S
t
+
1
,
a
′
)
(5.5.1)
\begin{aligned}\pi\left(S_{t+1}\right)=\underset{a^{\prime}}{\operatorname{argmax}} Q\left(S_{t+1}, a^{\prime}\right)\end{aligned}\tag{5.5.1}
π(St+1)=a′argmaxQ(St+1,a′)(5.5.1)
行动策略采用
ϵ
−
g
r
e
e
d
y
p
o
l
i
c
y
\epsilon-greedy \ policy
ϵ−greedy policy 用来采样生成行动的样本
价值的更新方式如下:
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
α
[
R
t
+
1
+
γ
max
a
Q
(
S
t
+
1
,
a
)
−
Q
(
S
t
,
A
t
)
]
(5.5.1)
\begin{aligned}Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right]\end{aligned}\tag{5.5.1}
Q(St,At)←Q(St,At)+α[Rt+1+γamaxQ(St+1,a)−Q(St,At)](5.5.1)
回溯图如下:
图中的弧线代表选择最大值。其流程如下:
最终Q会收敛到最优动作价值函数。
def execute_Q_learning_one_episode(env, agent, render=False):
total_rewards, total_steps = 0.0, 0.0
state = env.reset()
while True:
if render:
env.render()
clear_output(wait=True)
time.sleep(0.02)
action = agent.execute_epsilon_greedy_policy(state) # continually update (s, a) tuple!
next_state, reward, done, _ = env.step(action)
total_rewards += reward
total_steps += 1.0
agent.learn(state, action, reward, next_state, done)
if done:
if render:
print('END')
print('total_steps: ', total_steps)
break
else:
state = next_state
return total_rewards, total_steps
Q-Learning中的learn
方法并没有用到next_action
但是每一幕仍然要遍历每个“状态—动作”二元组,所以应该把 action = agent.execute_epsilon_greedy_policy(state)
写在循环内部!
episodes = 5000
agent = Q_LearningAgent(env)
results = [execute_Q_learning_one_episode(env, agent) for _ in range(episodes)]
unzipped_results = list(zip(*results))
steps = unzipped_results[1]
rewards = unzipped_results[0]
smoothed_steps = pd.Series(steps).rolling(30, 30).mean()
plt.figure(figsize=(15, 12))
plt.title("steps of each episode", fontsize=20, color='r')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.plot(smoothed_steps, color='b')
plt.savefig('QL_steps.png')
smoothed_rewards = pd.Series(rewards).rolling(30, 30).mean()
plt.figure(figsize=(15, 12))
plt.title("steps of each episode", fontsize=20, color='r')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.plot(smoothed_rewards, color='b')
plt.savefig('QL_rewards.png')
可以看出两条曲线和SARSA算法得出来的基本一致。
最后再来测试一下得到的策略:
# test policy
agent.epsilon = 0.0
test_results = [execute_Q_learning_one_episode(env, agent) for _ in range(1000)]
test_unzipped_results = list(zip(*test_results))
steps = test_unzipped_results[1]
rewards = test_unzipped_results[0]
print('average steps per episode: ', sum(steps) / len(steps))
average steps per episode: 12.981
参考资料
《强化学习原理与Python实现》肖智清