Report

POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3rd Ed Sec 17.4 Planning using Partially Observable Markov Decision Processes: A Tutorial Presenters: Eric Hansen, Mississippi State University Daniel Bernstein, University of Massachusetts/Amherst Zhengzhu Feng, University of Massachusetts/Amherst Rong Zhou, Mississippi State University Introduction and foundations Definition of POMDP Goals, rewards and optimality criteria Examples and applications Computational complexity Belief states and Bayesian conditioning Planning under partial observability Imperfect observation Goal Environment Environment Action Two Approaches to Planning under Partial Observability Nondeterministic planning Probabilistic (decision-theoretic) planning Uncertainty is represented by set of possible states No possibility is considered more likely than any other Uncertainty is represented by probability distribution over possible states In this tutorial we consider the second, more general approach Markov models Prediction Fully observable Markov chain Planning MDP (Markov decision process) Partially observable Hidden Markov model POMDP (Partially observable Markov decision process) Definition of POMDP hidden states: s0 S1 S2 observations: z0 z1 z2 actions: a0 a1 a2 rewards: r0 r1 r2 Goals, rewards and optimality criteria Rewards are additive and time-separable, and objective is to maximize expected total reward Traditional planning goals can be encoded in reward function Example: achieving a state satisfying property P at minimal cost is encoded by making any state satisfying P a zero-reward absorbing state, and assigning all other states negative reward. POMDP allows partial satisfaction of goals and tradeoffs among competing goals Planning horizon can be finite, infinite or indefinite Machine Maintenance X Canonical application of POMDPs in Operations Research Robot Navigation Canonical application of POMDPs in AI Toy example from Russell & Norvig’s AI textbook +1 Actions: N, S, E, W, Stop 0.1 0.1 –1 Start Observations: sense surrounding walls 0.8 Many other applications Helicopter control [Bagnell & Schneider 2001] Dialogue management [Roy, Pineau & Thrun 2000] Preference elicitation [Boutilier 2002] Optimal search and sensor scheduling [Krishnamurthy & Singh 2000] Medical diagnosis and treatment [Hauskrecht & Fraser 2000] Packet scheduling in computer networks [Chang et al. 2000; Bent & Van Hentenryck 2004] Computational complexity Finite-horizon PSPACE-hard [Papadimitriou & Tsitsiklis 1987] NP-complete if unobservable Infinite-horizon Undecidable [Madani, Hanks & Condon 1999] NP-hard for -approximation [Lusena, Goldsmith & Mundhenk 2001] NP-hard for memoryless or bounded-memory control problem [Littman 1994; Meuleau et al. 1999] POMDP <S, A, T, R, Ω, O> tuple S, A, T, R of MDP Ω – finite set of observations O:SxA-> Π(Ω) Belief state - information state – b, probability distribution over S - b(s1) POMDP a world o Goal is to maximize expected long-term reward from the initial state distribution State is not directly observed Two sources of POMDP complexity Curse of dimensionality size of state space shared by other planning problems Curse of memory size of value function (number of vectors) or equivalently, size of controller (memory) dimensionality memory unique to POMDPs Complexity of each iteration of DP: | S | | A || 2 n1 Z | n | | Two representations of policy Policy maps history to action Since history grows exponentially with horizon, it needs to be summarized, especially in infinite-horizon case Two ways to summarize history belief state finite-state automaton – partitions history into finite number of “states” Belief simplex S1 (0, 0, 1) S2 (0, 1) 0 S0 0 (1, 0) 2 states S1 (0, 1, 0) (1, 0, 0) 3 states S0 Belief state has Markov property The process of maintaining the belief state is Markovian For any belief state, the successor belief state depends only on the action and observation a1 z1 z1 P(s0) = 0 z2 z2 a2 P(s0) = 1 Belief-state MDP State space: the belief simplex Actions: same as before State transition function: P(b’|b,a) = e E P(b’|b,a,e)P(e|b,a) Reward function: r(b,a) =sS b(s)r(s,a) Bellman optimality equation: V (b) maxaA r (b, a) P(b' b, a)V (b' ) b' Should be Integration… Belief-state controller “State Estimation” Obs. e P(b|b,a,e) b Current Belief State (Register) b Policy a Action Update belief state after action and observation Policy maps belief state to action Policy is found by solving the belief-state MDP POMDP as MDP in Belief Space Dynamic Programming for POMDPs We’ll start with some important concepts: s1 s2 s3 0.25 0.40 0.35 belief state a1 o1 o 2 a2 o1 a3 o2 a2 o1 a1 a3 o2 a1 policy tree s1 s2 linear value function Dynamic Programming for POMDPs a1 a2 s1 s2 Dynamic Programming for POMDPs o1 a1 o2 a1 o1 a1 a1 a2 o2 a1 o1 a1 o2 a1 o1 a1 a2 a2 o2 a2 o1 a1 o2 a2 o1 a2 a1 a2 o2 a1 o1 a1 o2 a2 o1 a2 a2 a2 o2 a2 s1 s2 Dynamic Programming for POMDPs o1 a1 o2 a1 o1 a1 a1 a2 o2 a1 o1 a2 a1 o2 a1 o1 a2 a2 o2 a2 s1 s2 Dynamic Programming for POMDPs s1 s2 [Finitie Horizon Case] POMDP Value Iteration: Basic Idea First Problem Solved Key insight: value function piecewise linear & convex (PWLC) Convexity makes intuitive sense Each line (hyperplane) represented with vector Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward Coefficients of line (hyperplane) e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1)) To find value function at b, find vector with largest dot pdt with b POMDP Value Iteration: Phase 1: One action plans Two states: 0 and 1 R(0)=0 ; R(1) = 1 [stay]0.9 stay; 0.1 go [go] 0.9 go; 0.1 stay Sensor reports correct state with 0.6 prob Discount facto=1 POMDP Value Iteration: Phase 2: Two action (conditional) plans stay 0 1 stay stay Point-based Value Iteration: Approximating with Exemplar Belief States Solving infinite-horizon POMDPs Value iteration: iteration of dynamic programming operator computes value function that is arbitrarily close to optimal Optimal value function is not necessarily piecewise linear, since optimal control may require infinite memory But in many cases, as Sondik (1978) and Kaelbling et al (1998) noticed, value iteration converges to a finite set of vectors. In these cases, an optimal policy is equivalent to a finite-state controller. Policy evaluation o2 o2 o1 s1q1 o 1 o2 o2 1 o1 q o2 s2q1 o1 o1 s1q2 q2 s2q2 As in the fully observable case, policy evaluation involves solving a system of linear equations. There is one unknown (and one equation) for each pair of system state and controller node Policy improvement 0 a0 2 a0 z0,z1 3 a1 z0 1 a1 4 a0 z1 V(b) V(b) 0 z0,z1 z0 z1 z0,z1 0 a0 z0,z1 3 a1 z0 0 z0,z1 z1 V(b) 0,2 0 z0,z1 4 a0 0 3 4 4 1 1 z1 1 a1 3 1 z0 0 a0 1 0 1 Per-Iteration Complexity of POMDP value iteration.. Number of a vectors needed at tth iteration Time for computing each a vector Approximating POMDP value function with bounds It is possible to get approximate value functions for POMDP in two ways Over constrain it to be a NOMDP: You get Blind Value function which ignores the observation Under-estimates value A “conformant” policy (over-estimates cost) For infinite horizon, it will be same action always! (only |A| policies) Relax it to be a FOMDP: You assume that the state is fully observable. Over-estimates value A “state-based” policy (under-estimates cost) Per iteration Upper bounds for leaf nodes can come from FOMDP VI and lower bounds from NOMDP VI Observations are written as o or z Comparing POMDPs with Nondeterministic conditional planning POMDP Non-Deterministic Case RTDP-Bel doesn’t do look ahead, and also stores the current estimate of value function (see update) ---SLIDES BEYOND THIS NOT COVERED-- Two Problems How to represent value function over continuous belief space? How to update value function Vt with Vt-1? POMDP -> MDP S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a) Running Example POMDP with Two states (s1 and s2) Two actions (a1 and a2) Three observations (z1, z2, z3) 1D belief space for a 2 state POMDP Probability that state is s1 Second Problem Can’t iterate over all belief states (infinite) for valueiteration but… Given vectors representing Vt-1, generate vectors representing Vt Horizon 1 No future Value function consists only of immediate reward e.g. R(s1, a1) = 0, R(s2, a1) = 1.5, R(s1, a2) = 1, R(s2, a2) = 0 b = <0.25, 0.75> Value of doing a1 = 1 x b(s1) + 0 x b(s2) = 1 x 0.25 + 0 x 0.75 Value of doing a2 = 0 x b(s1) + 1.5 x b(s2) = 0 x 0.25 + 1.5 x 0.75 Second Problem Break problem down into 3 steps -Compute value of belief state given action and observation -Compute value of belief state given action -Compute value of belief state Horizon 2 – Given action & obs If in belief state b,what is the best value of doing action a1 and seeing z1? Best value = best value of immediate action + best value of next action Best value of immediate action = horizon 1 value function Horizon 2 – Given action & obs Assume best immediate action is a1 and obs is z1 What’s the best action for b’ that results from initial b when perform a1 and observe z1? Not feasible – do this for all belief states (infinite) Horizon 2 – Given action & obs Construct function over entire (initial) belief space from horizon 1 value function with belief transformation built in Horizon 2 – Given action & obs S(a1, z1) corresponds to paper’s S() built in: - horizon 1 value function - belief transformation - “Weight” of seeing z after performing a - Discount factor - Immediate Reward S() PWLC Second Problem Break problem down into 3 steps -Compute value of belief state given action and observation -Compute value of belief state given action -Compute value of belief state Horizon 2 – Given action What is the horizon 2 value of a belief state given immediate action is a1? Horizon 2, do action a1 Horizon 1, do action…? Horizon 2 – Given action What’s the best strategy at b? How to compute line (vector) representing best strategy at b? (easy) How many strategies are there in figure? What’s the max number of strategies (after taking immediate action a1)? Horizon 2 – Given action How can we represent the 4 regions (strategies) as a value function? Note: each region is a strategy Horizon 2 – Given action Sum up vectors representing region Sum of vectors = vectors (add lines, get lines) Correspond to paper’s transformation Horizon 2 – Given action What does each region represent? Why is this step hard (alluded to in paper)? Second Problem Break problem down into 3 steps -Compute value of belief state given action and observation -Compute value of belief state given action -Compute value of belief state Horizon 2 a1 U a2 Horizon 2 This tells you how to act! => Purge Second Problem Break problem down into 3 steps -Compute value of belief state given action and observation -Compute value of belief state given action -Compute value of belief state Use horizon 2 value function to update horizon 3’s ... The Hard Step Easy to visually inspect to obtain different regions But in higher dimensional space, with many actions and observations….hard problem Naïve way - Enumerate How does Incremental Pruning do it? Incremental Pruning Combinations Purge/ Filter How does IP improve naïve method? Will IP ever do worse than naïve method? Incremental Pruning What’s other novel idea(s) in IP? RR: Come up with smaller set D as argument to Dominate() RR has more linear pgms but less contraints in the worse case. Empirically ↓ constraints saves more time than ↑ linear programs require Incremental Pruning What’s other novel idea(s) in IP? RR: Come up with smaller set D as argument to Dominate() Why are the terms after U needed? Identifying Witness Witness Thm: -Let Ua be a set of vectors representing value function -Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 ) -If there is a vector v which differs from u in one observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and there is a b such that b.v > b.u, -then Ua is not equal to the true value function Witness Algm Randomly choose a belief state b Compute vector representing best value at b (easy) Add vector to agenda While agenda is not empty • Get vector Vtop from top of agenda • b’ = Dominate(Vtop, Ua) • If b’ is not null (there is a witness), compute vector u for best value at b’ and add it to Ua compute all vectors v’s that differ from u at one observation and add them to agenda b’ b’’ b b’ b’’ Linear Support If value function is incorrect, biggest diff is at edges (convexity) Linear Support Number of of policy trees |Z|T-1 |A| at horizon T Example for |A| = 4 and |Z| = 2 Horizon 0 1 2 3 4 # of policy trees 1 4 64 16,384 1,073,741,824 Policy graph V(b) a0 a0 a1 a2 z0 z0 z1 2 states z1 a1 z0 z1 z1 z0, z1 z0,z1 a2 z0 Policy iteration for POMDPs Sondik’s (1978) algorithm represents policy as a mapping from belief states to actions only works under special assumptions very difficult to implement never used Hansen’s (1998) algorithm represents policy as a finite-state controller fully general easy to implement faster than value iteration Properties of policy iteration Theoretical Monotonically improves finite-state controller Converges to -optimal finite-state controller after finite number of iterations Empirical Runs from 10 to over 100 times faster than value iteration Scaling up State abstraction and factored representation Belief compression Forward search and sampling approaches Hierarchical task decomposition State abstraction and factored representation of POMDP DP algorithms are typically state-based Most AI representations are “feature-based” |S| is typically exponential in the number of features (or variables) – the “curse of dimensionality” State-based representations for problems with more than a few variables are impractical Factored representations exploit regularities in transition and observation probabilities, and reward Example: Part-painting problem [Draper, Hanks, Weld 1994] Boolean state variables flawed (FL), blemished (BL), painted (PA), processed (PR), notified (NO) actions Inspect, Paint, Ship, Reject, Notify cost function Cost of 1 for each action Cost of 1 for shipping unflawed part that is not painted Cost of 10 for shipping flawed part or rejecting unflawed part initial belief state Pr(FL) = 0.3, Pr(BL|FL) = 1.0, Pr(BL|FL) = 0.0, Pr(PA) = 0.0, Pr(PR) = 0.0, Pr(NO) = 0.0 Factored representation of MDP [Boutilier et al. 1995; Hoey, St. Aubin, Hu, & Boutilier 1999] FL SL-FL PA FL’ SL-FL’ PA’ SH SH’ RE RE’ NO NO’ FL T F FL’ FL’ 1.0 0.0 FL 1.0 PA T F F F F SH RE T/F T/F F F T T/F T/F T T/F T/F NO T/F F T/F T/F T 0.0 PA’ 1.0 0.95 0.0 0.0 0.0 PA SH RE NO 0.95 Dynamic Belief Network PA’ Probability Tables 0.0 1.0 Decision Diagrams Dynamic Bayesian network captures variable independence Algebraic decision diagram captures value independence Decision diagrams X X Y Y Z TRUE FALSE Binary decision Diagram (BDD) Y Z 5.8 3.6 18.6 Z 9.5 Algebraic decision diagram (ADD) Operations on decision diagrams Addition (subtraction), multiplication (division), minimum (maximum), marginalization, expected value X X Y + Z 1.0 2.0 3.0 X Y = Z 10.0 20.0 Y Y Z 30.0 Z 11.0 12.0 22.0 23.0 33.0 Complexity of operators depends on size of decision diagrams, not number of states! Symbolic dynamic programming for factored POMDPs [Hansen & Feng 2000] Factored representation of value function: replace |S|-vectors with ADDs that only make relevant state distinctions Two steps of DP algorithm Generate new ADDs for value function Prune dominated ADDs State abstraction is based on aggregating states with the same value Generation step: Symbolic implementation observation probabilities obs1 akt+1 action reward transition probabilities obs2 obs3 ait Pruning step: Symbolic implementation pruning is the most computationally expensive part of algorithm must solve a linear program for each (potential) ADD in value function because state abstraction reduces the dimensionality of linear programs, it significantly improves efficiency Improved performance Test problem 1 2 3 4 5 6 7 Degree of abstraction 0.01 0.03 0.10 0.12 0.44 0.65 1.00 Degree of abstraction = Speedup factor Generate Prune 42 26 17 11 0.4 3 0.8 0.6 -3.4 0.4 -0.7 0.1 -6.5 -0.1 Number abstract states Number primitive states Optimal plan (controller) for part-painting problem FL FL Paint OK FL PA FL FL PA Ship PR NO Notify PR NO PR Inspect FL BL FL BL FL BL FL BL OK ~OK Inspect ~OK Reject FL FL Approximate state aggregation Simplify each ADD in value function by merging leaves that differ in value by less than . = 0.4 Approximate pruning Prune vectors from value function that add less than to value of any belief state a1 a2 a3 a4 (0,1) (1,0) Error bound These two methods of approximation share the same error bound “Weak convergence,” i.e., convergence to within 2/(1-) of optimal (where is discount factor) After “weak convergence,” decreasing allows further improvement Starting with relatively high and gradually decreasing it accelerates convergence Approximate dynamic programming Strategy: ignore differences of value less than some threshold Complementary methods Approximate state aggregation Approximate pruning …address two sources of complexity size of state space size of value function (memory) Belief compression Reduce dimensionality of belief space by approximating the belief state Examples of approximate belief states tuple of mostly-likely state plus entropy of belief state [Roy & Thrun 1999] belief features learned by exponential family Principal Components Analysis [Roy & Gordon 2003] standard POMDP algorithms can be applied in the lower-dimensional belief space, e.g., grid-based approximation Forward search z0 a0 a1 z0 z1 z0 z1 z0 a0 a1 z1 z0 a0 a1 z1 z0 a0 z1 z0 z1 a1 z1 z0 a0 z1 z0 a1 z1 z0 z1 Sparse sampling Forward search can be combined with Monte Carlo sampling of possible observations and action outcomes [Kearns et al 2000; Ng & Jordan 2000] Remarkably, complexity is independent of size of state space!!! On-line planner selects -optimal action for current belief state State-space decomposition For some POMDPs, each action/observation pair identifies a specific region of the state space Motivating Example Continued A “deterministic observation” reveals that world is in one of a small number of possible states Same for “hybrid POMDPs”, which are POMDPs with some fully observable and some partially observable state variables Region-based dynamic programming Tetrahedron and surfaces Hierarchical task decomposition We have considered abstraction in state space Now we consider abstraction in action space For fully observable MDPs: Options [Sutton, Precup & Singh 1999] HAMs [Parr & Russell 1997] Region-based decomposition [Hauskrecht et al 1998] MAXQ [Dietterich 2000] Hierarchical approach may cause sub-optimality, but limited forms of optimality can be guaranteed Hierarchical optimality (Parr and Russell) Recursive optimality (Dietterich) Hierarchical approach to POMDPs Theocharous & Mahadevan (2002) Pineau et al (2003) based on hierarchical hidden Markov model approximation ~1000 state robot hallway-navigation problem based on Dietterich’s MAXQ decomposition approximation ~1000 state robot navigation and dialogue Hansen & Zhou (2003) also based on Dietterich’s MAXQ decomposition convergence guarantees and epsilon-optimality Macro action as finite-state controller clear South wall East goal wall West Stop clear East wall North goal clear Allows exact modeling of macro’s effects macro state transition probabilities macro rewards Taxi example [Dietterich 2000] Task hierarchy [Dietterich 2000] Taxi Get Put Pickup Putdown Navigate North South East West Hierarchical finite-state controller Get Stop Pickup Put Nav. Nav. East Stop North Stop Putdown North West South Stop East Stop MAXQ-hierarchical policy iteration Create initial sub-controller for each sub-POMDP in hierarchy Repeat until error bound is less than Identify subtask that contributes most to overall error Use policy iteration to improve the corresponding controller For each node of controller, create abstract action (for parent task) and compute its model Propagate error up through hierarchy Modular structure of controller Complexity reduction Per-iteration complexity of policy iteration |A| |Q||Z|, where A is set of actions Z is set of observations Q is set of controller nodes Per-iteration complexity of hierarchical PI |A| i |Qi||Z| , where |Q| = i|Qi| With hierarchical decomposition, complexity is sum of the complexity of the subproblems, instead of product Scalability MAXQ-hierarchical policy iteration can solve any POMDP, if it can decompose it into subPOMDPs that can be solved by policy iteration Although each sub-controller is limited in size, the hierarchical controller is not limited in size Although the (abstract) state space of each subtask is limited in size, the total state space is not limited in size Multi-Agent Planning with POMDPs Partially observable stochastic games Generalized dynamic programming Multi-Agent Planning with POMDPs 1 a1 z1 , r1 a2 2 world z2 , r2 Many planning problems involve multiple agents acting in a partially observable environment The POMDP framework can be extended to address this Partially observable stochastic game (POSG) A POSG is S, A1, A2, Z1, Z2,, P, r1, r2, where S is a finite state set, with initial state s0 A1, A2 are finite action sets Z1, Z2 are finite observation sets P(s’|s, a1, a2) is state transition function P(z1, z2| s, a1, a2) is observation function r1(s, a1, a2) and r2(s, a1, a2) are reward functions Special cases: All agents share the same reward function Zero-sum games Plans and policies A local policy is a mapping i : Zi* Ai A joint policy is a pair 1, 2 Each agent wants to maximize its own long-term expected reward Although execution is distributed, planning can be centralized Beliefs in POSGs With a single agent, a belief is a distribution over states How does this generalize to multiple agents? Could have beliefs over beliefs over beliefs, but there is no algorithm for working with these Example States: grid cell pairs Actions: ,,, Transitions: noisy Goal: pick up balls Observations: red lines Another Example msg msg States: who has a message to send? Actions: send or don’t send Reward: +1 for successful broadcast 0 if collision or channel not used Observations: was there a collision? (noisy) Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly exponential in the horizon length … … R111, R112 … … Rm11, Rm12 … … R1n1, R1n2 … Rmn1, Rmn2 Generalized dynamic programming Initialize 1-step policy trees to be actions Repeat: Evaluate all pairs of t-step trees from current sets Iteratively prune dominated policy trees Form exhaustive sets of t+1-step trees from remaining t-step trees What Generalized DP Does The algorithm performs iterated elimination of dominated strategies in the normal form game without first writing it down For cooperative POSGs, the final sets contain the optimal joint policy Some Implementation Issues As before, pruning can be done using linear programming Algorithm keeps value function and policy trees in memory (unlike POMDP case) Currently no way to prune in an incremental fashion A Better Way to Do Elimination We use dynamic programming to eliminate dominated strategies without first converting to normal form Pruning a subtree eliminates the set of trees containing it o1 a2 a3 o1 o2 a1 prune a1 o a2 o1 o 2 a1 o1 a2 o2 a2 o1 a3 2 a2 a2 o2 a3 o1 a3 o2 a2 o1 a3 a2 eliminate o2 a1 Dynamic Programming Build policy tree sets simultaneously Prune using a generalized belief space o1 a1 a3 p1 o2 o1 a1 a3 a2 p2 o2 a1 agent 2 state space s1 s2 o1 a2 a2 q1 o2 o1 a2 a1 a3 q2 agent 1 state space o2 a2 Dynamic Programming a1 a2 a1 a2 Dynamic Programming o1 a1 a1 o1 a1 o2 a1 o1 a2 a2 a2 o2 a1 o1 o2 a1 o1 a1 a2 o2 a1 o1 a2 a2 a1 o2 a2 a2 o1 o2 a1 o1 a1 a1 o2 a2 o1 a2 a1 a1 a2 o2 a2 o1 a1 o2 a1 o1 a2 a2 a2 o2 a1 o1 o2 a1 o1 a1 a2 o2 a1 o1 a2 a2 a1 o2 a2 a2 o2 a1 o1 a1 o2 a2 o1 a2 a1 a2 o2 a2 Dynamic Programming o1 a1 a1 o1 a2 a1 a1 a1 o1 o2 a1 o1 o1 o2 a2 a2 o2 a1 a1 o2 a2 o1 a2 a1 a1 o2 a2 o1 a2 a1 a2 o2 a2 o1 a1 o2 a1 o1 a2 a2 a2 o2 a1 o1 o2 a1 o1 a1 a2 o2 a1 o1 a2 a2 a1 o2 a2 a2 o2 a1 o1 a1 o2 a2 o1 a2 a1 a2 o2 a2 Dynamic Programming o1 a1 a1 o1 a2 a1 a1 o1 o2 a1 o1 o2 a2 a2 o2 a1 a1 o2 a2 o1 a2 a1 o2 a2 o1 a2 a1 a2 o2 a2 o1 a1 o2 a1 o1 a2 a2 a2 o2 a1 o1 a1 a2 o2 a2 o1 a1 o2 a2 o1 a2 a1 a2 o2 a2 Dynamic Programming o1 a1 a1 o2 a1 o1 a1 o2 a2 o1 a2 a2 o2 a1 o1 a2 a1 a2 o2 a2 o1 a1 o2 a1 o1 a2 a2 a2 o2 a1 o1 a1 a2 o2 a2 o1 a1 o2 a2 o1 a2 a1 a2 o2 a2 Dynamic Programming o1 a1 a1 o2 a1 o1 a1 o2 a2 o1 a2 a2 o2 a1 o1 a2 a1 a2 o2 a2 o1 a1 o2 a1 o1 a2 a2 a2 o2 a1 o1 a1 o2 a2 o1 a2 a1 a2 o2 a2 Dynamic Programming Complexity of POSGs The cooperative finite-horizon case is NEXP-hard, even with two agents whose observations completely determine the state [Bernstein et al. 2002] Implications: The problem is provably intractable (because P NEXP) It probably requires doubly exponential time to solve in the worst case