Simple Bayesian Supervised Models

Report
Simple Bayesian Supervised Models
Saskia Klein & Steffen Bollmann
1
Content


Recap from last weak
Bayesian Linear Regression




What is linear regression?
Application of the Bayesian Theory on Linear Regression
Example
Comparison to Conventional Linear Regression

Bayesian Logistic Regression
Naive Bayes classifier

Source:


Bishop (ch. 3,4); Barber (ch. 10)
Saskia Klein & Steffen Bollmann
2
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Linear Regression


goal: predict the value of a target variable  given the
value of a D-dimensional vector  of input variables
→
linear regression models: linear functions of the adjustable
parameters

for example:
 = .  ⋅  + .  ⋅ 
−.  ⋅  + .  ⋅ 

Saskia Klein & Steffen Bollmann
5
Linear Regression

Training




{  } … training data set comprising  observations, where
 = 1, … , 
{ } … corresponding target values
compute the weights
Prediction



goal: predict the value of  for a new value of 
= model the predictive distribution   
and make predictions of  in such a way as to minimize the
expected value of a loss function
Saskia Klein & Steffen Bollmann
6
Examples of linear regression models

simplest linear regression model:
 ,  = 0 + 1 1 + … +  
−1
  =   
 ,  =
=0


linear function of the weights/parameters  and the data 
linear regression models using basis functions :
−1
  () =   ()
 ,  =
=0


 = 0 , … −1 
 = 0 , … , −1 
Saskia Klein & Steffen Bollmann
7
Bayesian Linear Regression

model:  =  ,  + 





 … target variable
 … model
 … data
 … weights/parameters
 … additive Gaussian noise:   = (0, −1 ) with zero
mean and precision (inverse variance) 
Saskia Klein & Steffen Bollmann
8
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression - Likelihood

likelihood function:
 t , ,  = (| ,  ,  −1 )

observation of N training data sets of inputs  =
1 , … ,   and target values  = {1 , … ,  }
(independently drawn from the distribution)

( |     ,  −1 )
  , ,  =
=1
Saskia Klein & Steffen Bollmann
10
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Bayesian Linear Regression - Prior



prior probability distribution over the model
parameters 
conjugate prior: Gaussian distribution
  =   0 , 0
mean 0 and covariance 0
Saskia Klein & Steffen Bollmann
13
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression – Posterior
Distribution

due to the conjugate prior, the posterior will also be
Gaussian
   = (| ,  )
 =  0−1 0 +  
−1

= 0−1 +  
 = 
(derivation: Bishop p.112)
Saskia Klein & Steffen Bollmann
15
Example Linear Regression

matlab
Saskia Klein & Steffen Bollmann
16
Predictive Distribution







making predictions of  for new values of 
predictive distribution:
  , , ,  = (|   , 2  )
variance of the distribution:
1
2
  = +     ()

first term represents the noise in the data
second term reflects the uncertainty associated with the
parameters 
optimal prediction, for a new value of , would be the
conditional mean of the target variable:
   = ∫  ⋅     = (, )
Saskia Klein & Steffen Bollmann
17
Common Problem in Linear Regression:
Overfitting/model complexitiy

Least Squares approach (maximizing the likelihood):




point estimate of the weights
Regularization: regularization term and value needs to be
chosen
Cross-Validation: requires large datasets and high
computational power
Bayesian approach:



distribution of the weights
good prior
model comparison: computationally demanding, validation data
not required
Saskia Klein & Steffen Bollmann
18
From Regression to Classification

for regression problems:


target variable  was the vector of real numbers whose values
we wish to predict
in case of classification:



target values represent class labels
two-class problem:   {1, 0}
K > 2:  = (0, 1, 0, 0, 0) → class 2
Saskia Klein & Steffen Bollmann
19
Classification

goal: take an input vector  and assign it to one of 
discrete classes 
decision boundary
Saskia Klein & Steffen Bollmann
20
Bayesian Logistic Regression

model the class-conditional densities    and the
prior probabilities   and apply Bayes Theorem:
    
   =
 
Saskia Klein & Steffen Bollmann
21
Bayesian Logistic Regression

exact Bayesian inference for logistic regression is
intractable

Laplace approximation


aims to find a Gaussian approximation to a probability density
defined over a set of continuous variables
posterior distribution is approximated around 
Saskia Klein & Steffen Bollmann
22
Example

Barber: DemosExercises\demoBayesLogRegression.m
Saskia Klein & Steffen Bollmann
23
Example

Barber: DemosExercises\demoBayesLogRegression.m
Saskia Klein & Steffen Bollmann
24
Naive Bayes classifier

Why naive?



strong independence assumptions
assumes that the presence/absence of a feature of a class is
unrelated to the presence/absence of any other feature, given
the class variable
Ignores relation between features and assumes that all feature
contribute independently to a class
[http://en.wikipedia.org/wiki/Naive_Bayes_classifier]
Saskia Klein & Steffen Bollmann
25
Thank you for your attention 
Saskia Klein & Steffen Bollmann
26

similar documents