### Slide

COSC 878
Seminar on Large Scale Statistical
Machine Learning
1
Today’s Plan
• Course Website
http://people.cs.georgetown.edu/~huiyang/cosc878/
78
• Students Introduction
• Team-up and Presentation Scheduling
• First Talk
2
Reinforcement Learning:
A Survey
Grace
1/13/15
What is Reinforcement Learning
• The problem faced by an agent that must
learn behavior through trial-and-error
interactions with a dynamic environment
Solve RL Problems – Two strategies
• Search in the behavior space
– To find one behavior that perform well in the
environment
– Genetic algorithms, genetic programming
• Statistical methods and dynamic programming
– Estimate the utility of taking actions in states of
the world
– We focus on this strategy
Standard RL model
What we learn in RL
• The agent’s job is to find a policy \pi that
maximizes some long-run measure of
reinforcement.
– A policy \pi maps states to actions
– Reinforcement = reward
Difference between RL and Supervised
Learning
• In RL, no presentation of input/output pairs
– No training data
– We only know the immediate reward
– Not know the best actions in long run
• In RL, need to evaluate the system online
while learning
– Online evaluation (know the online performance)
is important
Difference between RL and
AI/Planning
• AI algorithms are less general
– AI algorithms require a predefined model of state
transitions
– And assume determinism
• RL assumes that the state space can be
enumerated and stored in memory
Models
• The difficult part:
– How to model future into the model
• Three models
– Finite horizon
– Infinite horizon
– Average-reward
Finite Horizon
• At a given moment in time, the agent
optimizes its expected reward for the next h
steps
• Ignore what will happen after h steps
Infinite Horizon
•
•
•
•
Maximize the long run reward
Does not put limit on the number of future steps
Future rewards are discounted geometrically
Mathematically more tractable than finite horizon
Discount factor
(between 0 and 1)
Average-reward
• Maximize the long run average reward
• It is the limiting case of infinite horizon when \gamma approaches 1
• Weakness:
– Cannot know when get large rewards
– When we prefer large initial reward, we have no way to know it in this
model
• Cures:
– Maximize both the long run average and the initial rewards
– The Bias optimal model
Compare model optimality
• all unlabeled
arrows produce
a reward of 0
• A single action
Compare model optimality
Finite horizon
h=4
• Upper line:
• 0+0+2+2+2=6
• Middle:
• 0+0+0+0+0=0
• Lower:
• 0+0+0+0+0=0
Compare model optimality
Infinite horizon
\gamma=0.9
• Upper line:
• 0*0.9^0 +
0*0.9^1+2*0.9^2+
2*0.9^3+2*0.9^4… =
2*0.9^2*(1+0.9+0.9^
2..)= 1.62*(1)/(10.9)=16.2
• Middle:
• … 10*0.9^5+…~=59
• Lower:
• … + 11*0.9^6+… =
58.5
Compare model optimality
Average reward
• Upper line:
• ~= 2
• Middle:
• ~=10
• Lower:
• ~= 11
Parameters
• Finite horizon and infinite horizon both have
parameters
–h
– \gamma
• These parameters matter to the choice of
optimality model
– Choose them carefully in your application
• Average reward model’s advantage: not
influenced by those parameters
MARKOV MODELS
19
Markov Process
• Markov Property1 (the “memoryless” property)
for a system, its next state depends on its
current state.
Pr(Si+1|Si,…,S0)=Pr(Si+1|Si)
s0
s1 ……
si+1 ……
si
e.g.
• Markov Process
a stochastic process with Markov property.
20
1A.
A. Markov, ‘06
Family of Markov Models
•
•
•
•
•
Markov Chain
Hidden Markov Model
Markov Decision Process
Partially Observable Markov Decision Process
Multi-armed Bandit
21
Markov Chain
(S, M)
• Discrete-time Markov process
 State S – web page
B
Pagerank(B)
D
Pagerank(D)
 Transition probability M
 PageRank: how likely a random
web surfer will land on a page
A
Pagerank(A)
C
Pagerank(C)
E
Pagerank(E)
Random jump
factor
1−
()
=
+

()
∈Π
# of pages
The stable state distribution of such an MC is PageRank
22
1L.
Page et. al., ‘99
Hidden Markov Model
(S, M, O, e)
• A Markov chain that states are hidden and
observable symbols are emitted with some
probability according to its states1.
s0
0
o0
p0
s1
1
o1
p1
s2
p2
……
2
o2
Si– hidden state
pi -- transition probability
oi --observation
ei --observation probability (emission probability)
23
1Leonard
E. Baum et. al., ‘66
Markov Decision Process (S, M, A, R, γ)
• MDP extends MC with actions and rewards1
s0
p0
s1
a0
p1
s2
a1
r0
p2
s3
a2
r1
r2
si– state
ai – action
ri – reward
pi – transition probability
24
1R.
Bellman, ‘57
……
Definition of MDP
• A tuple (S, M, A, R, γ)
– S : state space
– M: transition matrix
Ma(s, s') = P(s'|s, a)
– A: action space
– R: reward function
R(s,a) = immediate reward taking action a at state s
– γ: discount factor, 0< γ ≤1
• policy π
π(s) = the action taken at state s
• Goal is to find an optimal policy π* maximizing the
expected total rewards.
25
Policy
Policy: (s) = a
(s0) =move right and up
s0
(s1) =move right and up
s1
s2
26
According to which,
select an action a at
state s.
(s2) = move right
[Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Expected long-term
Value:
reward starting from
s ) + 3 R(s )
Future rewards
V(s ) = E[R(s ) +  R(s ) + 2 R(s
V(s)
0
0
+ 4 R(s4) + ]
1
2
3
discounted by   [0,1)
s0
R(s0)
Start from s0
27
(s0)
[Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Expected long-term
Value:
reward starting from
s ) + 3 R(s )
Future rewards
V(s ) = E[R(s ) +  R(s ) + 2 R(s
V(s)
0
0
1
+ 4 R(s4) + ]
(s0)
discounted by   [0,1)
R(s1’)
s1
s1’’
28
3
s1’
s0
R(s0)
Start from s0
2
R(s1)
R(s1’’)
[Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Expected long-term
Value:
reward starting from
s ) + 3 R(s )
Future rewards
V(s ) = E[R(s ) +  R(s ) + 2 R(s
V(s)
0
0
1
+ 4 R(s4) + ]
discounted by   [0,1)
s2’
(s1’)
(s0)
R(s1’)
s1
R(s2’)
(s1)
s1’’
29
3
s1’
s0
R(s0)
Start from s0
2
s2
R(s1)
R(s1’’)
s2’’
(s1’’)
R(s2)
R(s2’’)
[Slide altered from Carlos Guestrin’s ML
lecture]
Computing the value of a policy
Value function
V(s0) =   [ 0 ,  +  1 ,  + 2 2 ,  + 3 3 ,  + ⋯ ]
=  [ 0 ,  +
∞
−1 ( , )]

=1
= 0 ,  +   [
∞
−1 ( , )]

=1
= 0 ,  +
′
The current
state
(,  ′ ) ( ′ )
A possible next state
30
Optimality — Bellman Equation
 The Bellman equation1 to MDP is a recursive
definition of the optimal value function V*(.)
state-value function
∗ s = max  ,  +

(, ′)∗(′)
′
 Optimal Policy
,  ′ ∗(′)
π∗ s = arg   ,  +

31
′
1R.
Bellman, ‘57
Optimality — Bellman Equation
 The Bellman equation can be rewritten as Relationship
between V and
∗  = max (, )
a
Q
action-value function
(, ) =  ,  +
(, ′)∗(′)
′
 Optimal Policy
π∗ s = arg   ,

32
MDP algorithms
Solve Bellman
equation
•
•
•
•
•
•
Optimal
value V*(s)
Optimal
policy *(s)
Value Iteration
Policy Iteration
Model-based
Modified Policy Iteration approaches
Prioritized Sweeping
Temporal Difference (TD) Learning Model free
approaches
Q-Learning
[Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto,
‘98, Richard Sutton, ‘88, Watkins, ‘92]
[Slide altered from Carlos Guestrin’s ML
33
lecture]