### Simple Bayesian Supervised Models

```Simple Bayesian Supervised Models
1
Content


Recap from last weak
Bayesian Linear Regression




What is linear regression?
Application of the Bayesian Theory on Linear Regression
Example
Comparison to Conventional Linear Regression

Bayesian Logistic Regression
Naive Bayes classifier

Source:


Bishop (ch. 3,4); Barber (ch. 10)
2
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Linear Regression


goal: predict the value of a target variable  given the
value of a D-dimensional vector  of input variables
→
linear regression models: linear functions of the adjustable
parameters

for example:
= .  ⋅  + .  ⋅
−.  ⋅  + .  ⋅

5
Linear Regression

Training




{  } … training data set comprising  observations, where
= 1, … ,
{ } … corresponding target values
compute the weights
Prediction



goal: predict the value of  for a new value of
= model the predictive distribution
and make predictions of  in such a way as to minimize the
expected value of a loss function
6
Examples of linear regression models

simplest linear regression model:
,  = 0 + 1 1 + … +
−1
=
,  =
=0


linear function of the weights/parameters  and the data
linear regression models using basis functions :
−1
() =   ()
,  =
=0


= 0 , … −1
= 0 , … , −1
7
Bayesian Linear Regression

model:  =  ,  +





… target variable
… model
… data
… weights/parameters
… additive Gaussian noise:   = (0, −1 ) with zero
mean and precision (inverse variance)
8
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression - Likelihood

likelihood function:
t , ,  = (| ,  ,  −1 )

observation of N training data sets of inputs  =
1 , … ,   and target values  = {1 , … ,  }
(independently drawn from the distribution)

( |     ,  −1 )
, ,  =
=1
10
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Bayesian Linear Regression - Prior



prior probability distribution over the model
parameters
conjugate prior: Gaussian distribution
=   0 , 0
mean 0 and covariance 0
13
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression – Posterior
Distribution

due to the conjugate prior, the posterior will also be
Gaussian
= (| ,  )
=  0−1 0 +
−1

= 0−1 +
=
(derivation: Bishop p.112)
15
Example Linear Regression

matlab
16
Predictive Distribution







making predictions of  for new values of
predictive distribution:
, , ,  = (|   , 2  )
variance of the distribution:
1
2
= +     ()

first term represents the noise in the data
second term reflects the uncertainty associated with the
parameters
optimal prediction, for a new value of , would be the
conditional mean of the target variable:
= ∫  ⋅     = (, )
17
Common Problem in Linear Regression:
Overfitting/model complexitiy

Least Squares approach (maximizing the likelihood):




point estimate of the weights
Regularization: regularization term and value needs to be
chosen
Cross-Validation: requires large datasets and high
computational power
Bayesian approach:



distribution of the weights
good prior
model comparison: computationally demanding, validation data
not required
18
From Regression to Classification

for regression problems:


target variable  was the vector of real numbers whose values
we wish to predict
in case of classification:



target values represent class labels
two-class problem:   {1, 0}
K > 2:  = (0, 1, 0, 0, 0) → class 2
19
Classification

goal: take an input vector  and assign it to one of
discrete classes
decision boundary
20
Bayesian Logistic Regression

model the class-conditional densities    and the
prior probabilities   and apply Bayes Theorem:

=

21
Bayesian Logistic Regression

exact Bayesian inference for logistic regression is
intractable

Laplace approximation


aims to find a Gaussian approximation to a probability density
defined over a set of continuous variables
posterior distribution is approximated around
22
Example

Barber: DemosExercises\demoBayesLogRegression.m
23
Example

Barber: DemosExercises\demoBayesLogRegression.m
24
Naive Bayes classifier

Why naive?



strong independence assumptions
assumes that the presence/absence of a feature of a class is
unrelated to the presence/absence of any other feature, given
the class variable
Ignores relation between features and assumes that all feature
contribute independently to a class
[http://en.wikipedia.org/wiki/Naive_Bayes_classifier]