### down

```Natural Gradient Works
Efficiently in Learning
S Amari
11.03.18.(Fri)
Computational Modeling of
Intelligence
Summarized by Joon Shik Kim
Abstract
• The ordinary gradient of a function does
not represent its steepest direction, but the
• The dynamical behavior of natural gradient
online learning is analyzed and is proved to
be Fisher efficient.
• The plateau phenomenon, which appears in
the backpropagation learning algorithm of
multilayer perceptron, might disappear or
might not be so serious when the natural
Introduction (1/2)
• The stochastic gradient method is a
popular learning method in the general
nonlinear optimization framework.
• The parameter space is not Euclidean but
has a Riemannian metric structure in
many cases.
• In these cases, the ordinary gradient
does not give the steepest direction of
target function.
Introduction (2/2)
• Barkai, Seung, and Sompolisky (1995)
generalize their idea and evaluate its
performance based on the Riemannian
metric of errors.
• The squared length of a small
incremental vector dw,
• When the coordinate system is
nonorthogonal, the squared length is
• The steepest descent direction of a
function L(w) at w is defined by the
vector dw has that minimizes L(w+dw)
where |dw| has a fixed length, that is,
under the constant,
• The steepest descent direction of L(w) in
a Riemannian space is given by,
• Risk function or average loss,
• Learning is a procedure to search for the
optimal w* that minimizes L(w).
Statistical Estimation of Probability
Density Function (1/2)
• In the case of statistical estimation, we
assume a statistical model {p(z,w)}, and
the problem is to obtain the probability
distribution p ( z , wˆ ) that approximates
the unknown density function q(z) in the
best way.
• Loss function is
Statistical Estimation of Probability
Density Function (2/2)
• The expected loss is then given by
Hz is the entropy of q(z) not depending
on w.
• Riemannian metric is Fisher information
Fisher Information as the Metric
of Kullback-Leibler Divergence
(1/2)
• p=q(θ+h)
D ( q || p ) 
q
 q ln
d
p
   q ln
p
d
q
2
  p
 1 p
 
   q    1     1  d 
 2 q
 
  q

lim
h 
D ( q ( ) || q (  h ))
h
2
1
1
q

2
q
 lim
h 
2
( p  q )( p  q ) d 
1 q (  h )  q ( ) q (  h )  q ( )
q q
2
h
h
d
Fisher Information as the Metric
of Kullback-Leibler Divergence
(2/2)
lim
h 
D ( q ( ) || q (  h ))
h
2
 lim
h 

1
2

2

q

2
q
q
1
1
2
q
I
1 q (  h )  q ( ) q (  h )  q ( )
1
2
h
1 q q
q  
2
d
 ln q  ln q


d
I: Fisher information
h
d
Multilayer Neural Network (1/2)
Multilayer Neural Network (2/2)
c is a normalizing constant
Natural Gradient Gives FisherEfficient Online Learning
Algorithms (1/4)
• DT = {(x1,y1),…,(xT,yT)} is T-independent
input-output examples generated by the
teacher network having parameter w*.
• Minimizing the log loss over the
training data DT is to obtain wˆ T that
minimizes the training error
Natural Gradient Gives FisherEfficient Online Learning
Algorithms (2/4)
• The Cramér-Rao theorem states that the
expected squared error of an unbiased
estimator satisfies
E [( wˆ T  w *) ( wˆ T  w *)] 
T
1
I
• An estimator is said to be efficient or
Fisher efficient when it satisfies above
equation.
Natural Gradient Gives FisherEfficient Online Learning
Algorithms (3/4)
• Theorem 2. The natural gradient online
estimator is Fisher efficient.
• Proof
Natural Gradient Gives FisherEfficient Online Learning
Algorithms (4/4)
```