Monte Carlo Search

Report
Monte Carlo Tree Search
A Tutorial
Supported by the
Engineering and
Physical Sciences
Research Council
(EPSRC)
Peter Cowling
University of York, UK
[email protected]
Simon Lucas
University of Essex, UK
[email protected]
Further Reading
Browne C., Powley E., Whitehouse D., Lucas S.,
Cowling P.I., Rohlfshagen P., Tavener S., Perez D. and
Samothrakis S., Colton S. (2012): "A Survey of Monte
Carlo Tree Search Methods" IEEE Transactions on
Computational Intelligence and AI in Games,
IEEE, 4 (1): 1-43.
Monte Carlo Tree Search (MCTS)
Why know more?
Go:
Strong/World Champion MCTS players in: General Game Playing,
Hex, Amazons, Arimaa, Khet, Shogi, Mancala, Morpion, Blokus,
Focus, Chinese Checkers, Yavalath, Connect Four, Gomoku
Over 250 papers since 2006 - about one per week on average.
Promise for planning, real-time games, nondeterministic games,
single-player puzzles, optimisation, simulation, PCG, …
Tutorial Structure
• Monte Carlo search
– The multi-armed bandit problem
• Monte Carlo Tree Search (MCTS)
– Selection, expansion, simulation, backpropagation
•
•
•
•
Enhancements to the basic algorithm
Handling uncertainty in MCTS
MCTS for real-time games
Conclusions, future research directions
Monte Carlo Search for Decision
Making in Games
No trees yet …
The Multi-Armed Bandit Problem
At each step pull one arm
Noisy/random reward signal
In order to:
Find the best arm
Minimise regret
Maximise expected return
Which Arm to Pull?
Flat Monte Carlo
Share trials uniformly
between arms
ε-Greedy
P(1- ε) – Best arm so far
P(ε) – Random arm
UCB1 (Auer et al (2002)).
Choose arm j so as to
maximise:
Mean
so far
Upper bound
on variance
Game Decisions
Current
position
Move A
Move B
Multi-Armed
Bandit
Move C
Position
after move
A
Position
after move
B
Position
after move
C
Arm A
Arm B
Arm C
Simulation Result
(+1 = win, 0 = loss)
Discussion
• Anytime – stop whenever you like
• UCB1 formula minimises regret
– Grows like log(n)
• Needs only game rules:
– Move generation
– Terminal state evaluation
• Surprisingly effective, but…
… doesn’t look ahead
MCTS / UCT Basics
Attractive Features
• Anytime
• Scalable
– Tackle complex games better than before
– May be logarithmically better with increased CPU
• No need for heuristic function
– Though usually better with one
• Next we’ll look at:
– General MCTS
– UCT in particular
MCTS: the main idea
• Tree policy: choose which node to expand (not necessarily leaf of
tree)
• Default (simulation) policy: random playout until end of game
MCTS Algorithm
• Decompose into 6 parts:
• MCTS main algorithm
– Tree policy
• Expand
• Best Child (UCT Formula)
– Default Policy
– Back-propagate
• We’ll run through these then show demos
MCTS Main Algorithm
•
•
BestChild simply picks best child node of root according to some criteria: e.g. best
mean value
In our pseudo-code BestChild is called from TreePolicy and from MctsSearch, but
different versions can be used
– E.g. final selection can be the max value child or the most frequently visited one
TreePolicy
• Note that node selected for expansion does not
need to be a leaf of the tree
– The nonterminal test refers to the game state
Expand
• Note: need to choose an appropriate data
structure for the children (array, ArrayList, Map)
Best Child (UCT)
• This is the standard UCT equation
– Used in the tree
• Higher values of c lead to more exploration
• Other terms can be added, and usually are
– More on this later
DefaultPolicy
• Each time a new node is added to the tree, the default policy
randomly rolls out from the current state until a terminal state of
the game is reached
• The standard is to do this uniformly randomly
– But better performance may be obtained by biasing with knowledge
Backup
• Note that v is the new node added to the tree by the tree
policy
• Back up the values from the added node up the tree to the
root
MCTS Builds Asymmetric Trees (demo)
Connect 4 Demo
Enhancements to the Basic MCTS
Algorithm
Tree Policy
Default Policy
Domain: independent or specific
Selection/Expansion Enhancements
(just a sample …)
• AMAF / RAVE
• First Play Urgency
• Learning
– E.g. Temporal Difference Learning to bias tree
policy or default policy
• Bandit enhancements
– E.g. Bayesian selection
• Parameter Tuning
All Moves As First (AMAF),
Rapid Value Action Estimates (RAVE)
• Additional term in UCT equation:
– Treat actions / moves the same independently of
where they occur in the move sequence
Simulation Enhancements
• Nudge the simulation to be more realistic
– Without introducing bias
– At very low CPU cost
• Examples:
–
–
–
–
Move-Average Sampling Technique (MAST)
Last Good Reply (LGR)
Patterns e.g. Bridge Completion:
Heuristic value functions
• Paradoxically better simulation policy sometimes
yields weaker play
Hex Demo
Parallelisation
Handling Uncertainty and
Incomplete Information in MCTS
Probabilistic (stochastic) nodes
CHANCE
MAX
MIN
MAX
Probabilistic (stochastic)
planning
Decision-making/optimisation
under uncertainty
Hidden Information
Information set:
Actual state:
Observation:
{
,
,
, ...}
E.g. Poker,
Bridge,
Security
Effects of Uncertainty and Hidden Information on
the Game Tree
4 possible
plays by me
choose
subset
50 possible random
card draws
40C3 =
9880 different
opponent plays
...
Reduced Branching Through Determinization
4 possible
plays by me
1 possible random
card draws
4C3 =
4 different
opponent plays
Use average results over
many determinizations
...
[Yoon, Fern & Givan 2007]
Determinization
•
•
•
•
Sample states from the information set
Analyse the individual perfect information games
Combine the results at the end
Successes
– Bridge (Ginsberg), Scrabble (Sheppard)
– Klondike solitaire (Bjarnason, Fern and Tadepalli)
– Probabilistic planning (Yoon, Fern and Givan)
• Problems
– Never tries to gather or hide information
– Suffers from strategy fusion and non-locality (Frank and
Basin)
M.L. Ginsberg, “GIB: Imperfect information in a computationally challenging game,” Journal of Artificial Intelligence Research, vol. 14, 2002, p. 313–368.
R. Bjarnason, A. Fern, and P. Tadepalli, “Lower bounding Klondike solitaire with Monte-Carlo planning,” Proc. ICAPS-2009, p. 26–33.
S. Yoon, A. Fern, and R. Givan, “FF-Replan: A Baseline for Probabilistic Planning,” 1Proc. ICAPS-2007, p. 352–359.
I. Frank and D. Basin, “Search in games with incomplete information: a case study using Bridge card play,” Artificial Intelligence, vol. 100, 1998, pp. 87-123.
33
Cheating
• Easiest approach to AI for games with
imperfect information: cheat and look
at the hidden information and
outcomes of future chance events
• This gives a deterministic game of
perfect information, which can be
searched with standard techniques
(minimax, UCT, …)
• Obviously not a reasonable approach in
practice, but gives a useful bound on
performance
• Much used in commercial games
Information Set Monte Carlo Tree Search (ISMCTS)
Maintain a single tree
Use each determinization
only once
Collect information about the
value of a decision in the
corresponding information set.
Determinizations
restrict search to a
subtree of the
information set tree.
Lord of the Rings: The Confrontation
• Board game designed by Reiner Knizia
• Hidden information: can’t see the identities of
opponent’s pieces
• Simultaneous moves: combat is resolved by both
players simultaneously choosing a card
Also Phantom 4,4,4,
Dou Di Zhu and several
other card games…
MCTS for Real-Time Games
• Limited roll-out budget
– Heuristic knowledge becomes important
• Action space may be too fine-grained
– Take Macro-actions
– Otherwise planning will be very short-term
• May be no terminal node in sight
– Again, use heuristic
– Tune simulation depth
• Next-state function may be expensive
– Consider making a simpler abstraction
PTSP Demo
Summary
• MCTS: exciting area of research
• Many impressive achievements already
– With many more to come
• Some interesting directions:
– Applications beyond games
– Comparisons with other search methods
– Heuristic knowledge is important – more work needed on
learning it
– Efficient parameter tuning
– POMDPs
– Uncertainty
– Macro-actions
– Hybridisation with other CI methods e.g. Evolution
Questions?
Browne C., Powley E., Whitehouse D., Lucas S., Cowling P.I., Rohlfshagen
P., Tavener S., Perez D. and Samothrakis S., Colton S. (2012): "A Survey of
Monte Carlo Tree Search Methods" IEEE Transactions on Computational
Intelligence and AI in Games, IEEE, 4 (1): 1-43.
Advertising Break
• Sep 28: Essex workshop on AI and Games
– Strong MCTS Theme
– Organised by Essex Game Intelligence Group
• Next week: AIGAMEDEV Vienna AI summit
– Find out what games industry people think
– Resistance competition and MCTS tutorial (!)

similar documents