Report

Value Iteration & Q-learning CS 5368 Song Cui Outline • Recap • Value Iteration • Q-learning Recap • “Markov” meanings • MDP: s S t a t A t( st ) t T (s, a, s' ) R (s, a, s' ) s0 st s sin • Solving MDP Recap • Utility of sequences Discount rate determines the “vision” of the agent. • State-value function Vπ(s) k U (s) V (s) E Rt k k 0 1| s s t • Action-value function Qπ(s,a) k Q (s, a ) E Rt k k 0 1 | s s, at a t Value Iteration • How to find Vk*(S): k → infinity Almost solution: recursion Correct solution: dynamic programming Value Iteration Value Iteration • Bellman update: • Another way: V-node Q-node V k 1 ( s ) max [ q k 1 ( s , a )] * * V-node a q k 1 ( s , a ) * T ( s , a , s ' )[ R ( s , a , s ' ) V k ( s ' )] * s' Value Iteration • Algorithm: s : V0 ( S ) 0 * for i = 1,2,3….. for s S,a A qi (s, a ) * T ( s , a , s ' )[ R ( s , a , s ' ) s' for s S V i ( s ) max [ q i ( s , a )] * * a i ( s ) argmax [ q i ( s , a )] * * a * V i 1 ( s ' )] Value Iteration • Theorem: convergence to a optimal value Policy may converge faster Three components to return U max U U t 1 U t 1 t V t 1 s | U (s) | U V U t t 1 t U 2 (1 ) Value Iteration • Advantages compared with Expectimax: Given MDP: state space: 1,2 action: 1,2 transition:80% reward: state1 →1, state2→0 0.8 S1 0.8 0.2 S1(1) 80% 1 20% S2 S1(1) 80% 0 0 S1 S1(1) S1(2) S2(1) S2 S2(2) 20% 1 S2(1) S1 Repeats ! Q-learning • Compared with Value Iteration: same: MDP model seeking policy different: T(s,a,s’) and R(s,a,s’) unkown different ways of solving RDP (learned model vs. unlearned model) • Reinforcement Learning policy, experience, reward model-based vs. model free passive learning vs. active learning Q-learning • Value iteration: • Q-values: Q-learning • Q-learning: Q-value iteration • Process: • sample: s →a→s’ r • Update new Q-value based on the sample: Q-learning • Q-learning: converge to optimal policy • Sample enough, leaning rate small enough Ways to explore: epsilon-greedy action selection : choose between acting randomly and acting accordingly to the best current Q-value