强化学习实战之Q-Learning

Q-Learning

Q-Learning 的代码部分和SARSA大同小异, 最大的不同是SARSA是on-policy,但是Q-Learning是off-policy。

import gym
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import pandas as pd
import time
from IPython.display import clear_output
env = gym.make('Taxi-v3')
class Q_LearningAgent:
    def __init__(self, env, epsilon=0.1, gamma=0.9, learning_rate=0.1):
        self.epsilon = epsilon
        self.gamma = gamma
        self.learning_rate = learning_rate
        self.action_n = env.action_space.n
        self.q_table = np.zeros((env.observation_space.n, env.action_space.n))
    
    
    def execute_epsilon_greedy_policy(self, state):
        if np.random.uniform() > self.epsilon:
            action = np.argmax(self.q_table[state])
        else:
            action = np.random.randint(self.action_n)
        return action
    
    
    def learn(self,state, action, reward, next_state, done):
        td_target = reward + self.gamma * np.max(self.q_table[next_state]) * (1.0 - done)
        td_error = td_target - self.q_table[state][action]
        self.q_table[state, action] += self.learning_rate * td_error

learn方法中,Q-learning的目标策略是贪婪策略,而不是 ϵ − g r e e d y   p o l i c y \epsilon-greedy \ policy ϵgreedy policy 这也是Q-Learning作为off-policy的体现。即:

目标策略采用贪婪策略:
π ( S t + 1 ) = argmax ⁡ a ′ Q ( S t + 1 , a ′ ) (5.5.1) \begin{aligned}\pi\left(S_{t+1}\right)=\underset{a^{\prime}}{\operatorname{argmax}} Q\left(S_{t+1}, a^{\prime}\right)\end{aligned}\tag{5.5.1} π(St+1)=aargmaxQ(St+1,a)(5.5.1)
行动策略采用 ϵ − g r e e d y   p o l i c y \epsilon-greedy \ policy ϵgreedy policy 用来采样生成行动的样本

价值的更新方式如下:
Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ max ⁡ a Q ( S t + 1 , a ) − Q ( S t , A t ) ] (5.5.1) \begin{aligned}Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right]\end{aligned}\tag{5.5.1} Q(St,At)Q(St,At)+α[Rt+1+γamaxQ(St+1,a)Q(St,At)](5.5.1)
回溯图如下:

在这里插入图片描述
图中的弧线代表选择最大值。其流程如下:

在这里插入图片描述

最终Q会收敛到最优动作价值函数。

def execute_Q_learning_one_episode(env, agent, render=False):
    total_rewards, total_steps = 0.0, 0.0
    state = env.reset()
    while True:
        if render:
            env.render()
            clear_output(wait=True)
            time.sleep(0.02)
        action = agent.execute_epsilon_greedy_policy(state) # continually update (s, a) tuple!
        next_state, reward, done, _ = env.step(action)
        total_rewards += reward
        total_steps += 1.0
        agent.learn(state, action, reward, next_state, done)
        if done:
            if render:
                print('END')
                print('total_steps: ', total_steps)
            break
        else:
            state = next_state
    return total_rewards, total_steps

Q-Learning中的learn方法并没有用到next_action但是每一幕仍然要遍历每个“状态—动作”二元组,所以应该把 action = agent.execute_epsilon_greedy_policy(state)写在循环内部!

episodes = 5000
agent = Q_LearningAgent(env)
results = [execute_Q_learning_one_episode(env, agent) for _ in range(episodes)]
unzipped_results = list(zip(*results))
steps = unzipped_results[1]
rewards = unzipped_results[0]
smoothed_steps = pd.Series(steps).rolling(30, 30).mean()
plt.figure(figsize=(15, 12))
plt.title("steps of each episode", fontsize=20, color='r')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.plot(smoothed_steps, color='b')
plt.savefig('QL_steps.png')

在这里插入图片描述

smoothed_rewards = pd.Series(rewards).rolling(30, 30).mean()
plt.figure(figsize=(15, 12))
plt.title("steps of each episode", fontsize=20, color='r')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.plot(smoothed_rewards, color='b')
plt.savefig('QL_rewards.png')

在这里插入图片描述

可以看出两条曲线和SARSA算法得出来的基本一致。

最后再来测试一下得到的策略:

# test policy
agent.epsilon = 0.0
test_results = [execute_Q_learning_one_episode(env, agent) for _ in range(1000)]
test_unzipped_results = list(zip(*test_results))
steps = test_unzipped_results[1]
rewards = test_unzipped_results[0]
print('average steps per episode: ', sum(steps) / len(steps))
average steps per episode:  12.981

参考资料

《强化学习原理与Python实现》肖智清

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值