### Deep Boltzmann Machines

```Deep Boltzman machines
Paper by : R. Salakhutdinov, G. Hinton
Outline
 Problems with some other methods!
 Energy based models
 Boltzmann machine
 Restricted Boltzmann machine
 Deep Boltzmann machine
Problems with other methods!
 Supervised learning need labeled data.
 Amount of information restricted by labels!
 Finding and knowing abnormalities before ever seeing them such as some
conditions in a nuclear power plant.
 So Instead of learning p(label | data) learn p(data)
Energy Based Models
 Some Energy function is defined.
 Energy function shows score (scalar value) assigned to a configuration.
 Ex.   =
−

, Boltzman (Gibbs) Distribution.
 Normalization factor (partition):  = Σx e−E x , integral of numerator over all
observations.
 Parameters that lead to lower energy are desired.
Boltzmann machine
 Markov random field (MRF) with hidden variables.
 Undirected edges representing dependency. Weights can be assigned.
 Conditional distributions over hidden and visible units:
Learning process
 Parameters update:
 Exact maximum likelihood learning is intractable.
 Use Gibbs sampling to approximate.
 Run 2 separate Markov chains to approximate them.
Restricted Boltzmann Machine
 Setting  = 0,  = 0.
 Without visible-visible and hidden-hidden connections!

 Learning carried out efficiently using Contrastive Divergence (CD)
 Or Stochastic approximation procedure (SAP)
 Variational Approach to estimating data-dependent expectations.
Stochastic approximation procedure
(SAP)
  and   : current parameters and state
  and   updated sequentially as :
 Given   , a new state  +1 sampled from a transition operator  ( +1 ;   ) that
leaves ℎ  invariant.
 New parameter +1 obtained by replacing intractable model’s expectation by
expectation with respect to  +1
 Learning rate has to decrease with time, for example by  = 1/.
Why go deep?
Why go deep?
 Deep architectures are representationally efficient, fewer computational
units for same function.
 Allow for showing a hierarchy.
 Non-local generalization
 Easier to monitor what is being learn
and guide the machine.
Deep Boltzmann Machine
 Undirected connection between all layers.
 Conditional distributions over visible and hidden:”
Pretraining (greedy layerwise)
MNIST dataset
NORB
Misclassification Error rate:
DBM : 10.8% , SVM:11.6% , logistic regression: 22.5% , K-nearest neighbors : 18.4%
Thank you!
```