《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结

本文是《强化学习:一本介绍》第三章的学习总结,重点讨论有限的Markov决策过程。强调了状态、动作的适切表示,奖励设计对RL的重要性,以及折扣回报在连续任务和周期任务中的应用。同时,解释了马尔科夫性质、MDP动态转移图、价值函数和贝尔曼方程的概念。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。



应用RL解决实际问题,目前已有的算法总的来说还是可以的,主要是要设计好能够反映问题本质的state/reward(action通常比较明确):

Of course, the particular states and actions vary greatly from task to task, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science. 

state表示:尽量符合markov property,当前的state尽量能够总结history的有用信息(一般情况下不可能是immediate sensations;也不会是complete history of all past sensations);

action:task specific,粒度一定要合适

reward设计:一定要反应我们的目标。通常情况,对于实现我们的goal的action,反馈一个正值,对于不想看到的action,可以反馈一个负值(To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment \jerkiness" of the motion );

reward设计一定要反映goal,有个很好的例子,exercise 3.5,reward of +1 for escaping from the maze and a reward of zero at all other times, and return G_t without discount, 意味着,不管你花了多少timestep,最后只要走出来了,得到的return竟然是一样的。。。。这可能会造成agent不知道你想要学什么东西,反正在里面兜兜转转,只要最终走出去就好了。相应的,对于走迷宫这样的任务,应该对于每一个timestep,做一个负的惩罚reward,这样agent就会学会尽快离开maze,从而获得更大的reward。。。(使用discount也可以,不过估计效果不明显)

reward设计是要考虑值的绝对大小还是相对大小???有个很好的例子exercise 3.9/3.10,对于continuing RL task来说,reward之间的【相对差距】才是关键(都是负数页没有关系的)!!!adding a constantc to all the rewards adds a constant, Vc, to the values of all states, and thus does not affect the relative values of any states under any policies. ==》Vc = c*sigma_over_k_from_k=0_to_k=infinite { γ^k })。对于episodic RL task来说,reward之间的【相对差距、本身绝对大小】都是关键!!!(Vc = c*sigma_over_k_from_k=0_to_k=K { γ^k } 后访问的状态只能累加少量的γ^k)。



reward hypothesis :RL最本质的出发点

    what we want/ what we mean by goals can be well thought of as the maximization of the expected cumulative sum of a scalar reward signal!

    maximization the reward signal is one of the most distinctive features of RL

    might at first appear limiting,in practice it has proved to be flexible and widely applicable, for example,.... 

    reward hypothesis是RL最本质的出发点,设计的reward,一定要实现maximizing reward对应着maximizing our goal!!!

The authoritative textbook for reinforcement learning by Richard Sutton and Andrew Barto. Contents Preface Series Forward Summary of Notation I. The Problem 1. Introduction 1.1 Reinforcement Learning 1.2 Examples 1.3 Elements of Reinforcement Learning 1.4 An Extended Example: Tic-Tac-Toe 1.5 Summary 1.6 History of Reinforcement Learning 1.7 Bibliographical Remarks 2. Evaluative Feedback 2.1 An -Armed Bandit Problem 2.2 Action-Value Methods 2.3 Softmax Action Selection 2.4 Evaluation Versus Instruction 2.5 Incremental Implementation 2.6 Tracking a Nonstationary Problem 2.7 Optimistic Initial Values 2.8 Reinforcement Comparison 2.9 Pursuit Methods 2.10 Associative Search 2.11 Conclusions 2.12 Bibliographical and Historical Remarks 3. The Reinforcement Learning Problem 3.1 The Agent-Environment Interface 3.2 Goals and Rewards 3.3 Returns 3.4 Unified Notation for Episodic and Continuing Tasks 3.5 The Markov Property 3.6 Markov Decision Processes 3.7 Value Functions 3.8 Optimal Value Functions 3.9 Optimality and Approximation 3.10 Summary 3.11 Bibliographical and Historical Remarks II. Elementary Solution Methods 4. Dynamic Programming 4.1 Policy Evaluation 4.2 Policy Improvement 4.3 Policy Iteration 4.4 Value Iteration 4.5 Asynchronous Dynamic Programming 4.6 Generalized Policy Iteration 4.7 Efficiency of Dynamic Programming 4.8 Summary 4.9 Bibliographical and Historical Remarks 5. Monte Carlo Methods 5.1 Monte Carlo Policy Evaluation 5.2 Monte Carlo Estimation of Action Values 5.3 Monte Carlo Control 5.4 On-Policy Monte Carlo Control 5.5 Evaluating One Policy While Following Another 5.6 Off-Policy Monte Carlo Control 5.7 Incremental Implementation 5.8 Summary 5.9 Bibliographical and Historical Remarks 6. Temporal-Difference Learning 6.1 TD Prediction 6.2 Advantages of TD Prediction Methods 6.3 Optimality of TD(0) 6.4 Sarsa: On-Policy TD Control 6.5 Q-Learning: Off-Policy TD Control 6.6 Actor-Critic Methods 6.7 R-Learning for Undiscounted Continuing Tasks 6.8 Games, Afterstates, and Other Special Cases 6.9 Summary 6.10 Bibliographical and Historical Remarks III. A Unified View 7. Eligibility Traces 7.1 -Step TD Prediction 7.2 The Forward View of TD( ) 7.3 The Backward View of TD( ) 7.4 Equivalence of Forward and Backward Views 7.5 Sarsa( ) 7.6 Q( ) 7.7 Eligibility Traces for Actor-Critic Methods 7.8 Replacing Traces 7.9 Implementation Issues 7.10 Variable 7.11 Conclusions 7.12 Bibliographical and Historical Remarks 8. Generalization and Function Approximation 8.1 Value Prediction with Function Approximation 8.2 Gradient-Descent Methods 8.3 Linear Methods 8.3.1 Coarse Coding 8.3.2 Tile Coding 8.3.3 Radial Basis Functions 8.3.4 Kanerva Coding 8.4 Control with Function Approximation 8.5 Off-Policy Bootstrapping 8.6 Should We Bootstrap? 8.7 Summary 8.8 Bibliographical and Historical Remarks 9. Planning and Learning 9.1 Models and Planning 9.2 Integrating Planning, Acting, and Learning 9.3 When the Model Is Wrong 9.4 Prioritized Sweeping 9.5 Full vs. Sample Backups 9.6 Trajectory Sampling 9.7 Heuristic Search 9.8 Summary 9.9 Bibliographical and Historical Remarks 10. Dimensions of Reinforcement Learning 10.1 The Unified View 10.2 Other Frontier Dimensions 11. Case Studies 11.1 TD-Gammon 11.2 Samuel's Checkers Player 11.3 The Acrobot 11.4 Elevator Dispatching 11.5 Dynamic Channel Allocation 11.6 Job-Shop Scheduling Bibliography Index
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值