### Learning-Multiplicat..

```Learning Multiplicative
Interactions
many slides from Hinton
Two different meanings of
“multiplicative”
• If we take two density models and multiply together their
probability distributions at each point in data-space, we
get a “product of experts”.
– The product of two Gaussian experts is a Gaussian.
• If we take two variables and we multiply them together to
provide input to a third variable we get a “multiplicative
interaction”.
– The distribution of the product of two Gaussiandistributed variables is NOT Gaussian distributed. It is
a heavy-tailed distribution. One Gaussian determines
the standard deviation of the other Gaussian.
– Heavy-tailed distributions are the signatures of
multiplicative interactions between latent variables.
Learning multiplicative interactions
• It is fairly easy to learn multiplicative interactions if all of
the variables are observed.
– This is possible if we control the variables used to
create a training set (e.g. pose, lighting, identity …)
• It is also easy to learn energy-based models in which all
but one of the terms in each multiplicative interaction
are observed.
– Inference is still easy.
• If more than one of the terms in each multiplicative
interaction are unobserved, the interactions between
hidden variables make inference difficult.
– Alternating Gibbs can be used if the latent variables
form a bi-partite graph.
Higher order Boltzmann machines
(Sejnowski, ~1986)
• The usual energy function is quadratic in the states:
− =   +

<
• But we could use higher order interactions:
− =   +
ℎ ℎ
,,ℎ
• Hidden unit h acts as a switch. When h is on, it
switches in the pairwise interaction between unit i
and unit j.
– Units i and j can also be viewed as switches that
control the pairwise interactions between j and h
or between i and h.
Using higher-order Boltzmann machines to
model image transformations
(Memisevic and Hinton, 2007)
• A global transformation specifies which pixel
goes to which other pixel.
• Conversely, each pair of similar intensity pixels,
one in each image, votes for a particular global
transformation.
image transformation
image(t)
image(t+1)
Using higher-order Boltzmann machines to
model image transformations
• For binary images, a simple energy function that captures all
possible correlations between the components of , ,  is
, ;  = −

ℎ
(1)
• Using this energy function, we can now define the joint
distribution  , | over outputs and hidden variables by
exponentiating and normalizing:
1
(2)
, | =
exp(− , ;  )
()

• From Eqs. 1 and 2, we get
ℎ |,  = (
|,  = (

)
ℎ )
Making the reconstruction easier
• Condition on the first image so that only one visible
group needs to be reconstructed.
– Given the hidden states and the previous
image, the pixels in the second image are
conditionally independent.
image transformation
image(t)
image(t+1)
The main problem with 3-way interactions
• energy function: − =   +
,,ℎ   ℎ ℎ
• There are far too many of them.
• We can reduce the number in several straightforward ways:
– Do dimensionality reduction on each group before
the three way interactions.
– Use spatial locality to limit the range of the threeway interactions.
• A much more interesting approach (which can be
combined with the other two) is to factor the
interactions so that they can be specified with fewer
parameters.
– This leads to a novel type of learning module.
Factoring three-way interactions
• We use factors that correspond to 3-way outerproducts.
=

ℎ
w jf
unfactored
E 
 si s j sh wijh
i, j,h
factored
 E    si s j sh wif w jf whf
f i, j,h
whf
wif
Factored 3-Way Restricted Boltzmann
Machines For Modeling Natural Images
(Ranzato, Krizhevsky and Hinton, 2010)
• Joint 3-way model
• Model the covariance structure of natural images.
The visible units are two identical copies
A powerful module for deep learning
• Define energy function in terms of 3-way
multiplicative interactions between two visible binary
units,  ,  , and one hidden binary unit ℎ :
,  = −

ℎ
• Model the three-way weights as a sum of “factors”, f,
each of which is a three-way outer product
=

,
• The factors are connected twice to the same image
through matrices B and C, it is natural to tie their
weights further reducing the number of parameters:
=

,
A powerful module for deep learning
• So the energy function becomes:
,  = −

(

)2 (

ℎ  )
• The parameters of the model can be learned by maximizing
the log likelihood, whose gradient is given by:

=<
>model −<
>data

• The hidden units conditionally independent given the states of
the visible units, and their binary states can be sampled
using:
ℎ |,  = (

2
+  )
• However, given the hidden states, the visible units are no
longer independent.
Producing reconstructions using
hybrid Monte Carlo
• Integrate out the hidden units and use the hybrid
Monte Carlo algorithm(HMC) on free energy:
=−

log(1 + exp(

2
+  ))
Modeling the joint density of two images under
a variety of tranformations
(Hinton et al., 2011)
• describe a generative model of the relationship
between two images
• The model is defined as a factored three-way
Boltzmann machine, in which hidden variables
collaborate to define the joint correlation matrix for
image pairs
Model
• Given two real-valued images  and , define the
matching score of triplets , ,  :

S , ,  =

(
=1  =1

)(

)(
=1
ℎ )
=1
• Add bias terms to matching score and get energy
function:
E , ,  = −S , ,
1
1

2
2
−  =1  ℎ +
(
−

)
+
(
−

)

2 =1
2 =1
(1)
• Exponentiate and normalize energy function:
1
p , ,  = exp(−E , ,  )

Model
• Marginalize over to get distribution over an
image pair ,  :
p ,  =
p , ,
∈{0,1}
• And the we can get
ℎ|,  =
|, ℎ =
(4)
|, ℎ =
(5)
bernoulli(
(
(
+
+
+

)

ℎ
; 1.0)

; 1.0)
(3)
• This shows that among the three sets of variables,
computation of the conditional distribution of any

=1 ，学习率
repeat

for  from 1 to  do

,

) for each

=  −     ,  =  −      ,
=  −  ℎ ℎ ,
=  +     ,  =  +      ,
=  − ,  =  −  ,  =  − ℎ,
=  +  ℎ ℎ ,

=  +   ,  =  +   ,  =  + ℎ,

end for
until 达到收敛条件

if  > 0.5 then

else

, ℎ

End if
Three-way
contrastive Divergence
Thank you
```