Report

The Reverend Thomas Bayes (1702-1761) Bayesian Inference Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen Why do I need to learn about Bayesian stats? Because SPM is getting more and more Bayesian: • Segmentation & spatial normalisation • Posterior probability maps (PPMs) • Dynamic Causal Modelling (DCM) • Bayesian Model Selection (BMS) • EEG: source reconstruction Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling Classical and Bayesian statistics p-value: probability of getting the observed data in the effect’s absence. If small, reject null hypothesis that there is no effect. H0 : 0 One can never accept the null hypothesis p( y | H 0 ) Given enough data, one can always demonstrate a significant effect Probability of observing the data y, given no effect ( = 0). Correction for multiple comparisons necessary Bayesian Inference Flexibility in modelling p( y, ) Incorporating prior information p( ) Posterior probability of effect p( | y ) Options for model comparison Statistical analysis and the illusion of objectivity. James O. Berger, Donald A. Berry Bayes‘ Theorem Posterior Beliefs Observed Data Reverend Thomas Bayes 1702 - 1761 Prior Beliefs “Bayes‘ Theorem describes, how an ideally rational person processes information." Wikipedia Bayes’ Theorem Given data y and parameters , the conditional probabilities are: p( y, ) p( | y ) p( y) p( y , ) p( y | ) p( ) Eliminating p(y,) gives Bayes’ rule: Likelihood Posterior p( y | ) p( ) P( | y ) p( y) Evidence Prior Bayesian statistics new data p( y | ) prior knowledge p( ) p( | y) p( y | ) p( ) posterior likelihood ∙ prior Bayes theorem allows one to formally incorporate prior knowledge into computing statistical probabilities. Priors can be of different sorts: empirical, principled or shrinkage priors, uninformative. The “posterior” probability of the parameters given the data is an optimal combination of prior knowledge and new data, weighted by their relative precision. Bayes in motion - an animation Principles of Bayesian inference Formulation of a generative model Likelihood function p(y|) Model prior distribution p() Observation of data Data y Model Inversion - Update of beliefs based upon observations, given a prior state of knowledge p( | y) p( y | ) p( ) Maximum a posteriori (MAP) Maximum likelihood (ML) Conjugate Priors Prior and Posterior have the same form p( y | ) p( ) p( | y ) p( y) Same form !! Analytical expression. Conjugate priors for all exponential family members. Example – Gaussian Likelihood , Gaussian prior over mean Gaussian Model Likelihood & prior p( y | ) N ( y | , e 1 ) y p( ) N ( | p , p1 ) Posterior: p( | y ) N ( | , ) 1 e p e p y p Relative precision weighting Posterior y Likelihood p Prior Bayesian regression: univariate case Normal densities p( ) N ( | p , p2 ) y x Univariate linear model | y p( y | ) N ( y | x , e2 ) y x p( | y ) N ( | | y , 2| y ) p 1 2| y | y x2 1 2 2 e p x 1 2 y 2 p p e 2 |y Relative precision weighting Bayesian GLM: multivariate case Normal densities General Linear Model p(θ) N (θ; η p , C p ) y Xθ e p(θ | y) N (θ; η | y , C | y ) 1 1 C | y XT Ce X C p 2 p(y | θ) N (y; Xθ, Ce ) 1 1 η | y C | y XT Ce y C p η p One step if Ce is known. Otherwise define conjugate prior or perform iterative estimation with EM. 1 An intuitive example 10 2 5 0 -5 Prior Likelihood Posterior -10 -10 -5 0 1 5 10 Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling Bayesian model selection (BMS) Given competing hypotheses on structure & functional mechanisms of a system, which model is the best? Which model represents the best balance between model fit and model complexity? For which model m does p(y|m) become maximal? Pitt & Miyung (2002), TICS Bayesian model selection (BMS) Bayes’ rule: p( | y, m) p( y | , m) p( | m) p ( y | m) Model evidence: p( y | m) p( y | , m) p( | m) d accounts for both accuracy and complexity of the model allows for inference about structure (generalizability) of the model Model comparison via Bayes factor: 1 ) (|1 ) (1 ) = (2 |) (|2 ) (2 ) Model averaging | = , (|) Kass and Raftery (1995), Penny et al. (2004) NeuroImage = (|1 ) (|2 ) BF10 Evidence against H0 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Decisive Bayesian model selection (BMS) Various Approximations: • Akaike Information Criterion (AIC) – Akaike, 1974 ln ) − • Bayesian Information Criterion (BIC) – Schwarz, 1978 1 ln () ≅ ln ) − ln() 2 • Negative free energy ( F ) • A by-product of Variational Bayes Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling Approximate Bayesian inference Bayesian inference formalizes model inversion, the process of passing from a prior to a posterior in light of data. posterior = likelihood prior ∫ , d marginal likelihood (model evidence) In practice, evaluating the posterior is usually difficult because we cannot easily evaluate , especially when: • High dimensionality, complex form • analytical solutions are not available • numerical integration is too expensive Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Approximate Bayesian inference There are two approaches to approximate inference. They have complementary strengths and weaknesses. Deterministic approximate inference Stochastic approximate inference in particular variational Bayes in particular sampling find an analytical proxy that is maximally similar to design an algorithm that draws samples 1 , … , from inspect distribution statistics of (e.g., mean, quantiles, intervals, …) often insightful and fast often hard work to derive converges to local minima inspect sample statistics (e.g., histogram, sample quantiles, …) asymptotically exact computationally expensive tricky engineering concerns Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling The Laplace approximation The Laplace approximation provides a way of approximating a density whose normalization constant we cannot evaluate, by fitting a Normal distribution to its mode. = 1 Pierre-Simon Laplace (1749 – 1827) main part of the density (easy to evaluate) French mathematician and astronomer × normalization constant (unknown) This is exactly the situation we face in Bayesian inference: = 1 × model evidence (unknown) , joint density (easy to evaluate) Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Applying the Laplace approximation Given a model with parameters = 1 , … , , the Laplace approximation reduces to a simple three-step procedure: Find the mode of the log-joint: ∗ = arg max ln , Evaluate the curvature of the log-joint at the mode: We obtain a Gaussian approximation: , Λ−1 with = ∗ Λ = −∇∇ ln , ∗ ∇∇ ln , ∗ Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Limitations of the Laplace approximation The Laplace approximation is often too strong a simplification. The TheLaplace Laplaceapproximation approximation The TheLaplace Laplaceapproximation approximation 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 ignores global properties of the posterior 0.15 0.15 0.1 0.1 0.05 0.05 0.05 0.05 00 10 10 20 20 becomes brittle when the posterior is multimodal 0.15 0.15 0.1 0.1 00 -10 -10 log logjoint joint approximation approximation 00 -10 -10 00 10 10 20 20 only directly applicable to real-valued parameters Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational Bayesian inference Variational Bayesian (VB) inference generalizes the idea behind the Laplace approximation. In VB, we wish to find an approximate density that is maximally similar to the true posterior. true posterior hypothesis class divergence KL || best proxy Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational calculus Variational Bayesian inference is based on variational calculus. Standard calculus Variational calculus Newton, Leibniz, and others Euler, Lagrange, and others • functions : ↦ • functionals : ↦ d • derivatives d d • derivatives d Example: maximize the likelihood expression w.r.t. Example: maximize the entropy w.r.t. a probability distribution Leonhard Euler (1707 – 1783) Swiss mathematician, ‘Elementa Calculi Variationum’ Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational calculus and the free energy Variational calculus lends itself nicely to approximate Bayesian inference. (,) ln () = ln =∫ ln ln = ∫ ln d (,) = ∫ ln = ∫ (,) + d (,) ln d d + ∫ ln (,) d KL[||] (, ) divergence between and free energy Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational calculus and the free energy In summary, the log model evidence can be expressed as: ln ln KL[| , ln () = KL[| + , KL[| divergence free energy (easy to evaluate ≥ 0 (unknown) ∗ for a given ) , Maximizing , is equivalent to: • minimizing KL[| • tightening , as a lower bound to the log model evidence initialization … … convergence Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Computing the free energy We can decompose the free energy (, ) as follows: (, ) = ∫ , ln d = ∫ ln , − ∫ ln = ln , expected log-joint + Shannon entropy Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf The mean-field assumption 1 When inverting models with several parameters, a common way of restricting the class of approximate posteriors is to consider those posteriors that factorize into independent partitions, = 2 , where is the approximate posterior for the th subset of parameters. 2 1 Jean Daunizeau, www.fil.ion.ucl.ac.uk/ ~jdaunize/presentations/Bayes2.pdf Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational algorithm under the mean-field assumption In summary: , = −KL ||exp ln , \ + Suppose the densities \ ≡ \ are kept fixed. Then the approximate posterior that maximizes , is given by: ∗ This implies a straightforward algorithm for variational inference: Initialize all approximate posteriors , e.g., by setting them to their priors. = arg max , = 1 exp ln , Therefore: ln ∗ = ln , \ \ − ln Cycle over the parameters, revising each given the current estimates of the others. Loop until convergence. =: Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Typical strategies in variational inference no parametric assumptions parametric assumptions = no mean-field assumption (variational inference = exact inference) fixed-form optimization of moments mean-field assumption = ∏ iterative free-form variational optimization iterative fixed-form variational optimization Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Example: variational density estimation We are given a univariate dataset 1 , … , which we model by a simple univariate Gaussian distribution. We wish to infer on its mean and precision: , Although in this case a closed-form solution exists*, we shall pretend it does not. Instead, we consider approximations that satisfy the meanfield assumption: mean precision = 0, 0 = Ga 0, 0 data = 1… , = , −1 , = Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf 10.1.3; Bishop (2006) PRML −1 Recurring expressions in Bayesian inference Univariate normal distribution ln , −1 1 2 1 2 = ln − ln − 1 2 2 − 2 = − 2 + + Multivariate normal distribution ln , Λ−1 1 2 1 − Λ 2 2 Λ = − ln Λ−1 − ln 2 − = + 1 2 − Λ − + Gamma distribution ln Ga , = ln − ln Γ + − 1 ln − = − 1 ln − + Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational density estimation: mean ln ∗ = ln , , = ln + , + ln = ln∏ ,−1 = = =− − − 2 + ln 0, 0 2 + ln −1 0 + − − 0 2 2 + + lnGa 0,0 + + 2 2 0 2 0 2 − + − − + 00 − 0 + 2 2 2 2 1 2 + 0 ⟹ ∗ = , −1 with 2 + + 00 = 0 + = + reinstation by inspection + 0 0 0 0 + = 0 + Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational density estimation: precision ln ∗ = ln , , + , −1 = ln =1 = =1 + ln 0 , 0 + ln Ga 0 , 0 + 1 ln − − 2 2 + 0 − 1 ln − 0 = ln − ∑ − 2 2 1 = + + 0 − 1 2 2 −1 2 ⟹ ∗ = Ga , 2 1 0 + ln 0 − − 0 2 2 2 + 1 1 0 + ln + ln − − 0 2 + 0 − 1 ln − 0 + 0 2 2 2 1 0 ln − ∑ − 2 + − 0 2 + 0 + 2 2 with +1 = 0 + 2 0 = 0 + − 0 2 2 1 + ∑ − 2 2 Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Variational density estimation: illustration ∗ Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf Bishop (2006) PRML, p. 472 Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling Markov Chain Monte Carlo(MCMC) sampling Posterior distribution Proposal distribution 0 1 q() 2 3 …………………... • A general framework for sampling from a large class of distributions • Scales well with dimensionality of sample space • Asymptotically convergent N Markov chain properties • Transition probabilities – homogeneous/invariant +1 1, … , ) = +1 ) = ( +1 , ) • Invariance Ergodic ∗ = Reversible Invariant ′ , ∗ ′ ′ • Ergodicity ∗ = lim →∞ Homogeneous ∀ ( 0 ) MCMC Metropolis-Hastings Algorithm • Initialize at step 1 - for example, sample from prior • At step t, sample from the proposal distribution: ∗ ~ ∗ t) • Accept with probability: ( ∗ , ) ~ 1, • Metropolis – Symmetric proposal distribution ∗ ) ( ∗ , ) ~ 1, ) Bishop (2006) PRML, p. 539 Gibbs Sampling Algorithm = (1 , 2 , … , ) • Special case of Metropolis Hastings • At step t, sample from the conditional distribution: 11 ~ 1 2, … … , 21 ~ (2|11, … … , ) : : : 1 ~ (3|11, … … , −1 1) • Acceptance probability = 1 • Blocked Sampling 11, 21 ~ 1, 2 3, … … , Posterior analysis from MCMC 0 1 2 3 …………………... Obtain independent samples: Generate samples based on MCMC sampling. Discard initial “burn-in” period samples to remove dependence on initialization. Thinning- select every mth sample to reduce correlation . Inspect sample statistics (e.g., histogram, sample quantiles, …) N Summary Bayesian Inference Model Selection Approximate Inference Variational Bayes MCMC Sampling References • • • • • • • • • • Pattern Recognition and Machine Learning (2006). Christopher M. Bishop Bayesian reasoning and machine learning. David Barber Information theory, pattern recognition and learning algorithms. David McKay Conjugate Bayesian analysis of the Gaussian distribution. Kevin P. Murphy. Videolectures.net – Bayesian inference, MCMC , Variational Bayes Akaike, H (1974). A new look at statistical model identification. IEEE transactions on Automatic Control 19, 716-723. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461-464. Kass, R.E. and Raftery, A.E. (1995).Bayes factors Journal of the American Statistical Association, 90, 773-795. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition (Chapman & Hall/CRC Texts in Statistical Science) Dani Gamerman, Hedibert F. Lopes. Statistical Parametric Mapping: The Analysis of Functional Brain Images. Thank You http://www.translationalneuromodeling.org/tapas/