### Bayesian Inference - Translational Neuromodeling Unit

```The Reverend Thomas Bayes
(1702-1761)
Bayesian Inference
Sudhir Shankar Raman
Translational Neuromodeling Unit,
UZH & ETH
With many thanks for materials to:
Klaas Enno Stephan & Kay H. Brodersen
Why do I need to learn about Bayesian stats?
Because SPM is getting more and more Bayesian:
• Segmentation & spatial normalisation
• Posterior probability maps (PPMs)
• Dynamic Causal Modelling (DCM)
• Bayesian Model Selection (BMS)
• EEG: source reconstruction
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
Classical and Bayesian statistics
p-value: probability of getting the observed data in the effect’s absence. If
small, reject null hypothesis that there is no effect.
H0 :   0
 One can never accept the null
hypothesis
p( y | H 0 )
 Given enough data, one can always
demonstrate a significant effect
Probability of observing the data
y, given no effect ( = 0).
 Correction for multiple comparisons
necessary
Bayesian Inference
 Flexibility in modelling
p( y, )
 Incorporating prior information
p( )
 Posterior probability of effect
p( | y )
 Options for model comparison
Statistical analysis and the illusion of objectivity.
James O. Berger, Donald A. Berry
Bayes‘ Theorem
Posterior
Beliefs
Observed
Data
Reverend Thomas Bayes
1702 - 1761
Prior
Beliefs
“Bayes‘ Theorem describes, how an ideally rational person
processes information."
Wikipedia
Bayes’ Theorem
Given data y and parameters , the conditional probabilities are:
p( y, )
p( | y ) 
p( y)
p( y ,  )
p( y |  ) 
p( )
Eliminating p(y,) gives Bayes’ rule:
Likelihood
Posterior
p( y |  ) p( )
P( | y ) 
p( y)
Evidence
Prior
Bayesian statistics
new data
p( y |  )
prior knowledge
p( )
p( | y)  p( y |  ) p( )
posterior
 likelihood ∙ prior
Bayes theorem allows one to
formally incorporate prior
knowledge into computing
statistical probabilities.
Priors can be of different sorts:
empirical, principled or shrinkage
priors, uninformative.
The “posterior” probability of the
parameters given the data is an
optimal combination of prior
knowledge and new data, weighted by
their relative precision.
Bayes in motion - an animation
Principles of Bayesian inference
 Formulation of a generative model
Likelihood function p(y|)
Model
prior distribution p()
 Observation of data
Data
y
 Model Inversion - Update of beliefs based upon
observations, given a prior state of knowledge
p( | y)  p( y |  ) p( )
Maximum a posteriori
(MAP)
Maximum likelihood
(ML)
Conjugate Priors
 Prior and Posterior have the same form
p( y |  ) p( )
p( | y ) 
p( y)
Same form !!
 Analytical expression.
 Conjugate priors for all exponential family members.
 Example – Gaussian Likelihood , Gaussian prior over mean
Gaussian Model
Likelihood & prior
p( y |  )  N ( y |  , e 1 )
y  
p( )  N ( |  p , p1 )

Posterior: p( | y )  N ( |  ,  )
1
  e   p
e  p
  y  p


Relative precision weighting
Posterior
y
Likelihood
p
Prior
Bayesian regression: univariate case
Normal densities
p( )  N ( |  p ,  p2 )
y  x  
Univariate
linear model
 | y
p( y |  )  N ( y | x ,  e2 )
y
x
p( | y )  N ( |  | y ,  2| y )
p
1
 2| y
 | y
x2 1
 2 2
e  p
 x

1
   2 y  2 p 

 p 
 e
2
|y
Relative precision weighting
Bayesian GLM: multivariate case
Normal densities
General
Linear
Model
p(θ)  N (θ; η p , C p )
y  Xθ  e
p(θ | y)  N (θ; η | y , C | y )
1
1
C | y  XT Ce X  C p

2
p(y | θ)  N (y; Xθ, Ce )
1
1
η | y  C | y XT Ce y  C p η p

 One step if Ce is known.
 Otherwise define conjugate prior or
perform iterative estimation with EM.
1
An intuitive example
10
2
5
0
-5
Prior
Likelihood
Posterior
-10
-10
-5
0
1
5
10
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
Bayesian model selection (BMS)
Given competing hypotheses
on structure & functional
mechanisms of a system,
which model is the best?
Which model represents the
best balance between model
fit and model complexity?
For which model m does
p(y|m) become maximal?
Pitt & Miyung (2002), TICS
Bayesian model selection (BMS)
Bayes’ rule: p( | y, m) 
p( y |  , m) p( | m)
p ( y | m)

Model evidence: p( y | m)  p( y |  , m)  p( | m) d
accounts for both accuracy and complexity of the model
allows for inference about structure (generalizability) of the model
Model comparison via Bayes factor:
1 ) (|1 ) (1 )
=
(2 |) (|2 ) (2 )
Model averaging
| =
,  (|)

Kass and Raftery (1995), Penny et al. (2004) NeuroImage
=
(|1 )
(|2 )
BF10
Evidence against H0
1 to 3
Not worth more than a
bare mention
3 to 20
Positive
20 to 150
Strong
> 150
Decisive
Bayesian model selection (BMS)
Various Approximations:
• Akaike Information Criterion (AIC) – Akaike, 1974
ln    ) −
• Bayesian Information Criterion (BIC) – Schwarz, 1978
1
ln () ≅ ln    ) −  ln()
2
• Negative free energy ( F )
• A by-product of Variational Bayes
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
Approximate Bayesian inference
Bayesian inference formalizes model inversion, the process of
passing from a prior to a posterior in light of data.
posterior
=
likelihood
prior

∫  ,  d
marginal likelihood
(model evidence)
In practice, evaluating the posterior is usually difficult because we cannot
easily evaluate   , especially when:
• High dimensionality, complex form
• analytical solutions are not available
• numerical integration is too expensive
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Approximate Bayesian inference
There are two approaches to approximate inference. They have
complementary strengths and weaknesses.
Deterministic
approximate inference
Stochastic
approximate inference
in particular variational Bayes
in particular sampling
 find an analytical proxy   that is
maximally similar to
 design an algorithm that draws
samples  1 , … ,   from

 inspect distribution statistics of
(e.g., mean, quantiles,
intervals, …)
 often insightful and fast
 often hard work to derive
 converges to local minima
 inspect sample statistics (e.g.,
histogram, sample quantiles, …)
 asymptotically exact
 computationally expensive
 tricky engineering concerns
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
The Laplace approximation
The Laplace approximation provides a way of approximating
a density whose normalization constant we cannot
evaluate, by fitting a Normal distribution to its mode.

=
1

Pierre-Simon
Laplace
(1749 – 1827)
main part of the density
(easy to evaluate)
French mathematician
and astronomer
×
normalization constant
(unknown)
This is exactly the situation we face in Bayesian inference:

=
1

×
model evidence
(unknown)
,
joint density
(easy to evaluate)
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Applying the Laplace approximation
Given a model with parameters  = 1 , … ,  , the Laplace
approximation reduces to a simple three-step procedure:

Find the mode of the log-joint:
∗ = arg max ln  ,


Evaluate the curvature of the log-joint at the mode:

We obtain a Gaussian approximation:
, Λ−1
with  =  ∗
Λ = −∇∇ ln  ,  ∗
∇∇ ln  ,  ∗
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Limitations of the Laplace approximation
The Laplace approximation is often too strong a simplification.
The
TheLaplace
Laplaceapproximation
approximation
The
TheLaplace
Laplaceapproximation
approximation
0.25
0.25
0.25
0.25
0.2
0.2
0.2
0.2
ignores global
properties of the
posterior
0.15
0.15
0.1
0.1
0.05
0.05
0.05
0.05
00
10
10
20
20
becomes brittle
when the posterior
is multimodal
0.15
0.15
0.1
0.1
00
-10
-10
log
logjoint
joint
approximation
approximation
00
-10
-10
00
10
10
20
20
only directly applicable to
real-valued parameters
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational Bayesian inference
Variational Bayesian (VB) inference generalizes the idea behind the
Laplace approximation. In VB, we wish to find an approximate density
that is maximally similar to the true posterior.
true
posterior
hypothesis
class

divergence
KL ||
best proxy

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational calculus
Variational Bayesian inference is based on variational calculus.
Standard calculus
Variational calculus
Newton, Leibniz, and
others
Euler, Lagrange, and
others
• functions
:  ↦
• functionals
:  ↦
d
• derivatives
d
d
• derivatives
d
Example: maximize the
likelihood expression
w.r.t.
Example: maximize the
entropy   w.r.t. a
probability distribution

Leonhard Euler
(1707 – 1783)
Swiss mathematician,
‘Elementa Calculi
Variationum’
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational calculus and the free energy
Variational calculus lends itself nicely to approximate Bayesian
inference.
(,)
ln () = ln

=∫
ln
ln
= ∫   ln
d
(,)

= ∫   ln
= ∫
(,)

+

d
(,)
ln

d
d + ∫   ln
(,)
d

KL[||]
(, )
divergence between
and
free energy
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational calculus and the free energy
In summary, the log model
evidence can be expressed as:
ln
ln
KL[|
,
ln () = KL[|  +  ,
KL[|
divergence free energy
(easy to evaluate
≥ 0
(unknown)
∗
for a given )
,
Maximizing  ,  is equivalent
to:
• minimizing KL[|
• tightening  ,  as a lower
bound to the log model
evidence
initialization
…
…
convergence
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Computing the free energy
We can decompose the free energy (, ) as follows:
(, ) = ∫
,
ln

d
= ∫   ln  ,   − ∫   ln
= ln  ,
expected
log-joint

+
Shannon
entropy
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
The mean-field assumption
1
When inverting models with
several parameters, a common
way of restricting the class of
approximate posteriors   is to
consider those posteriors that
factorize into independent
partitions,
=
2
,

where   is the approximate
posterior for the  th subset of
parameters.
2
1
Jean Daunizeau, www.fil.ion.ucl.ac.uk/
~jdaunize/presentations/Bayes2.pdf
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational algorithm under the mean-field assumption
In summary:
, = −KL ||exp ln ,
\
+
Suppose the densities \ ≡  \ are
kept fixed. Then the approximate
posterior   that maximizes  ,  is
given by:
∗
This implies a straightforward
algorithm for variational
inference:
 Initialize all approximate
posteriors   , e.g., by
setting them to their
priors.
= arg max  ,

=
1
exp

ln  ,
Therefore:
ln ∗ = ln  ,
\
\
− ln
 Cycle over the parameters,
revising each given the
current estimates of the
others.
 Loop until convergence.
=:
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Typical strategies in variational inference
no parametric
assumptions
parametric
assumptions
=
no mean-field
assumption
(variational inference
= exact inference)
fixed-form
optimization of
moments
mean-field
assumption
= ∏
iterative free-form
variational
optimization
iterative fixed-form
variational
optimization
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Example: variational density estimation
We are given a univariate dataset
1 , … ,  which we model by a simple
univariate Gaussian distribution. We
wish to infer on its mean and precision:
,
Although in this case a closed-form
solution exists*, we shall pretend it
approximations that satisfy the meanfield assumption:
mean precision

=   0, 0
= Ga  0, 0

data
= 1…
,  =   , −1
,  =
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
10.1.3; Bishop (2006) PRML
−1
Recurring expressions in Bayesian inference
Univariate normal distribution
ln   , −1
1
2
1
2
= ln  − ln  −
1
2

2
−
2
= −  2 +  +
Multivariate normal distribution
ln   , Λ−1
1
2
1
−   Λ
2

2
Λ
= − ln Λ−1 − ln 2 −
=
+
1
2
−  Λ  −
+
Gamma distribution
ln Ga  ,
=  ln  − ln Γ  +  − 1 ln  −
=  − 1 ln  −   +
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational density estimation: mean
ln  ∗  = ln  , ,

= ln

+
,
+ ln

= ln∏  ,−1
=
=
=−

−  −
2

+ ln  0, 0
2

+ ln
−1
0
+ −
− 0
2

2

+
+ lnGa  0,0

+
+

2
2 0    2
0 2
−
+     −
−
+ 00    − 0 +
2
2
2
2
1

2

+ 0
⟹  ∗  =    , −1

with
2 +

+ 00
= 0 +
=

+
reinstation by
inspection

+ 0 0

0 0 +
=
0 +
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational density estimation: precision
ln  ∗  = ln  , ,

+

,  −1
= ln
=1

=
=1
+ ln   0 , 0
+ ln Ga  0 , 0

+

1

ln  −  −
2
2
+ 0 − 1 ln  − 0

= ln − ∑  −
2
2
1
= + + 0 − 1
2 2
−1
2
⟹ ∗  = Ga   ,
2

1
0
+ ln 0  −
− 0
2
2
2

+
1
1
0
+
ln
+
ln
−
− 0 2   + 0 − 1 ln − 0 +

0
2
2
2
1
0
ln  −
∑  −  2   +
− 0 2   + 0  +
2
2
with
+1
= 0 +
2
0
= 0 +
− 0
2
2

1
+ ∑  −
2
2

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational density estimation: illustration

∗
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Bishop (2006) PRML, p. 472
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
Markov Chain Monte Carlo(MCMC) sampling
Posterior
distribution
Proposal
distribution

0
1
q()
2
3
…………………...
• A general framework for sampling from a large class of
distributions
• Scales well with dimensionality of sample space
• Asymptotically convergent
N
Markov chain properties
• Transition probabilities – homogeneous/invariant
+1 1, … , ) =   +1 ) =  ( +1 , )
• Invariance
Ergodic
∗  =
Reversible
Invariant
′ ,  ∗  ′
′
• Ergodicity
∗  = lim
→∞
Homogeneous
∀ ( 0 )
MCMC
Metropolis-Hastings Algorithm
• Initialize  at step 1 - for example, sample from prior
• At step t, sample from the proposal distribution:
∗ ~   ∗ t)
• Accept with probability:
( ∗ ,   ) ~  1,
• Metropolis – Symmetric proposal distribution
∗ )

( ∗ ,   ) ~  1,
)
Bishop (2006) PRML, p. 539
Gibbs Sampling Algorithm
= (1 , 2 , … ,  )
• Special case of Metropolis Hastings
• At step t, sample from the conditional distribution:
11 ~  1 2, … … ,
21 ~ (2|11, … … , )
:
:
:
1 ~ (3|11, … … , −1 1)
• Acceptance probability = 1
• Blocked Sampling
11, 21 ~
1, 2 3, … … ,
Posterior analysis from MCMC
0
1
2
3
…………………...
Obtain independent samples:

Generate samples based on MCMC sampling.

Discard initial “burn-in” period samples to remove dependence
on initialization.

Thinning- select every mth sample to reduce correlation .

Inspect sample statistics (e.g., histogram, sample quantiles, …)
N
Summary
Bayesian Inference
Model Selection
Approximate Inference
Variational
Bayes
MCMC
Sampling
References
•
•
•
•
•
•
•
•
•
•
Pattern Recognition and Machine Learning (2006). Christopher M. Bishop
Bayesian reasoning and machine learning. David Barber
Information theory, pattern recognition and learning algorithms. David McKay
Conjugate Bayesian analysis of the Gaussian distribution. Kevin P. Murphy.
Videolectures.net – Bayesian inference, MCMC , Variational Bayes
Akaike, H (1974). A new look at statistical model identification. IEEE
transactions on Automatic Control 19, 716-723.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics
6, 461-464.
Kass, R.E. and Raftery, A.E. (1995).Bayes factors Journal of the American
Statistical Association, 90, 773-795.
Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,
Second Edition (Chapman & Hall/CRC Texts in Statistical Science) Dani
Gamerman, Hedibert F. Lopes.
Statistical Parametric Mapping: The Analysis of Functional Brain Images.
Thank You
http://www.translationalneuromodeling.org/tapas/
```