Report

Monte Carlo Tree Search A Tutorial Supported by the Engineering and Physical Sciences Research Council (EPSRC) Peter Cowling University of York, UK [email protected] Simon Lucas University of Essex, UK [email protected] Further Reading Browne C., Powley E., Whitehouse D., Lucas S., Cowling P.I., Rohlfshagen P., Tavener S., Perez D. and Samothrakis S., Colton S. (2012): "A Survey of Monte Carlo Tree Search Methods" IEEE Transactions on Computational Intelligence and AI in Games, IEEE, 4 (1): 1-43. Monte Carlo Tree Search (MCTS) Why know more? Go: Strong/World Champion MCTS players in: General Game Playing, Hex, Amazons, Arimaa, Khet, Shogi, Mancala, Morpion, Blokus, Focus, Chinese Checkers, Yavalath, Connect Four, Gomoku Over 250 papers since 2006 - about one per week on average. Promise for planning, real-time games, nondeterministic games, single-player puzzles, optimisation, simulation, PCG, … Tutorial Structure • Monte Carlo search – The multi-armed bandit problem • Monte Carlo Tree Search (MCTS) – Selection, expansion, simulation, backpropagation • • • • Enhancements to the basic algorithm Handling uncertainty in MCTS MCTS for real-time games Conclusions, future research directions Monte Carlo Search for Decision Making in Games No trees yet … The Multi-Armed Bandit Problem At each step pull one arm Noisy/random reward signal In order to: Find the best arm Minimise regret Maximise expected return Which Arm to Pull? Flat Monte Carlo Share trials uniformly between arms ε-Greedy P(1- ε) – Best arm so far P(ε) – Random arm UCB1 (Auer et al (2002)). Choose arm j so as to maximise: Mean so far Upper bound on variance Game Decisions Current position Move A Move B Multi-Armed Bandit Move C Position after move A Position after move B Position after move C Arm A Arm B Arm C Simulation Result (+1 = win, 0 = loss) Discussion • Anytime – stop whenever you like • UCB1 formula minimises regret – Grows like log(n) • Needs only game rules: – Move generation – Terminal state evaluation • Surprisingly effective, but… … doesn’t look ahead MCTS / UCT Basics Attractive Features • Anytime • Scalable – Tackle complex games better than before – May be logarithmically better with increased CPU • No need for heuristic function – Though usually better with one • Next we’ll look at: – General MCTS – UCT in particular MCTS: the main idea • Tree policy: choose which node to expand (not necessarily leaf of tree) • Default (simulation) policy: random playout until end of game MCTS Algorithm • Decompose into 6 parts: • MCTS main algorithm – Tree policy • Expand • Best Child (UCT Formula) – Default Policy – Back-propagate • We’ll run through these then show demos MCTS Main Algorithm • • BestChild simply picks best child node of root according to some criteria: e.g. best mean value In our pseudo-code BestChild is called from TreePolicy and from MctsSearch, but different versions can be used – E.g. final selection can be the max value child or the most frequently visited one TreePolicy • Note that node selected for expansion does not need to be a leaf of the tree – The nonterminal test refers to the game state Expand • Note: need to choose an appropriate data structure for the children (array, ArrayList, Map) Best Child (UCT) • This is the standard UCT equation – Used in the tree • Higher values of c lead to more exploration • Other terms can be added, and usually are – More on this later DefaultPolicy • Each time a new node is added to the tree, the default policy randomly rolls out from the current state until a terminal state of the game is reached • The standard is to do this uniformly randomly – But better performance may be obtained by biasing with knowledge Backup • Note that v is the new node added to the tree by the tree policy • Back up the values from the added node up the tree to the root MCTS Builds Asymmetric Trees (demo) Connect 4 Demo Enhancements to the Basic MCTS Algorithm Tree Policy Default Policy Domain: independent or specific Selection/Expansion Enhancements (just a sample …) • AMAF / RAVE • First Play Urgency • Learning – E.g. Temporal Difference Learning to bias tree policy or default policy • Bandit enhancements – E.g. Bayesian selection • Parameter Tuning All Moves As First (AMAF), Rapid Value Action Estimates (RAVE) • Additional term in UCT equation: – Treat actions / moves the same independently of where they occur in the move sequence Simulation Enhancements • Nudge the simulation to be more realistic – Without introducing bias – At very low CPU cost • Examples: – – – – Move-Average Sampling Technique (MAST) Last Good Reply (LGR) Patterns e.g. Bridge Completion: Heuristic value functions • Paradoxically better simulation policy sometimes yields weaker play Hex Demo Parallelisation Handling Uncertainty and Incomplete Information in MCTS Probabilistic (stochastic) nodes CHANCE MAX MIN MAX Probabilistic (stochastic) planning Decision-making/optimisation under uncertainty Hidden Information Information set: Actual state: Observation: { , , , ...} E.g. Poker, Bridge, Security Effects of Uncertainty and Hidden Information on the Game Tree 4 possible plays by me choose subset 50 possible random card draws 40C3 = 9880 different opponent plays ... Reduced Branching Through Determinization 4 possible plays by me 1 possible random card draws 4C3 = 4 different opponent plays Use average results over many determinizations ... [Yoon, Fern & Givan 2007] Determinization • • • • Sample states from the information set Analyse the individual perfect information games Combine the results at the end Successes – Bridge (Ginsberg), Scrabble (Sheppard) – Klondike solitaire (Bjarnason, Fern and Tadepalli) – Probabilistic planning (Yoon, Fern and Givan) • Problems – Never tries to gather or hide information – Suffers from strategy fusion and non-locality (Frank and Basin) M.L. Ginsberg, “GIB: Imperfect information in a computationally challenging game,” Journal of Artificial Intelligence Research, vol. 14, 2002, p. 313–368. R. Bjarnason, A. Fern, and P. Tadepalli, “Lower bounding Klondike solitaire with Monte-Carlo planning,” Proc. ICAPS-2009, p. 26–33. S. Yoon, A. Fern, and R. Givan, “FF-Replan: A Baseline for Probabilistic Planning,” 1Proc. ICAPS-2007, p. 352–359. I. Frank and D. Basin, “Search in games with incomplete information: a case study using Bridge card play,” Artificial Intelligence, vol. 100, 1998, pp. 87-123. 33 Cheating • Easiest approach to AI for games with imperfect information: cheat and look at the hidden information and outcomes of future chance events • This gives a deterministic game of perfect information, which can be searched with standard techniques (minimax, UCT, …) • Obviously not a reasonable approach in practice, but gives a useful bound on performance • Much used in commercial games Information Set Monte Carlo Tree Search (ISMCTS) Maintain a single tree Use each determinization only once Collect information about the value of a decision in the corresponding information set. Determinizations restrict search to a subtree of the information set tree. Lord of the Rings: The Confrontation • Board game designed by Reiner Knizia • Hidden information: can’t see the identities of opponent’s pieces • Simultaneous moves: combat is resolved by both players simultaneously choosing a card Also Phantom 4,4,4, Dou Di Zhu and several other card games… MCTS for Real-Time Games • Limited roll-out budget – Heuristic knowledge becomes important • Action space may be too fine-grained – Take Macro-actions – Otherwise planning will be very short-term • May be no terminal node in sight – Again, use heuristic – Tune simulation depth • Next-state function may be expensive – Consider making a simpler abstraction PTSP Demo Summary • MCTS: exciting area of research • Many impressive achievements already – With many more to come • Some interesting directions: – Applications beyond games – Comparisons with other search methods – Heuristic knowledge is important – more work needed on learning it – Efficient parameter tuning – POMDPs – Uncertainty – Macro-actions – Hybridisation with other CI methods e.g. Evolution Questions? Browne C., Powley E., Whitehouse D., Lucas S., Cowling P.I., Rohlfshagen P., Tavener S., Perez D. and Samothrakis S., Colton S. (2012): "A Survey of Monte Carlo Tree Search Methods" IEEE Transactions on Computational Intelligence and AI in Games, IEEE, 4 (1): 1-43. Advertising Break • Sep 28: Essex workshop on AI and Games – Strong MCTS Theme – Organised by Essex Game Intelligence Group • Next week: AIGAMEDEV Vienna AI summit – Find out what games industry people think – Resistance competition and MCTS tutorial (!)