Projection pursuit, ICA, NCA, PLS

Dimension reduction (2)
Projection pursuit
Partial Least Squares
Blais. “The role of the environment in synaptic plasticity…..” (1998)
Liao et al. PNAS. (2003)
Barker & Raynes. J. Chemometrics 2013.
Projection pursuit
A very broad term: finding the most “interesting”
direction of projection. How the projection is done
depends on the definition of “interesting”. If it is maximal
variation, then PP leads to PCA.
In a narrower sense:
Finding non-Gaussian projections.
For most high-dimensional clouds, most low-dimensional
projections are close to Gaussian  important
information in the data is in the directions for which the
projected data is far from Gaussian.
Projection pursuit
It boils down to objective functions – each kind of
“interesting” has its own objective function(s) to maximize.
Projection pursuit
Projection pursuit with multi-modality as
Projection pursuit
One objective function to measure multi-modality:
It uses the first three moments of the distribution.
It can help finding clusters through visualization.
To find w, the function is maximized over w by gradient
Projection pursuit
Can think of PCA as a case of PP, with the objective
For other PC directions, find projection onto space
orthogonal to the previously found PCs.
éæ æ
ö ö ù
Rw = E êçç w' ç I - å wi w'÷ x ÷÷ ú
êè è
ø ø úû
Projection pursuit
Some other objective functions (y is the RV generated by
projection w’x)
The Kurtosis as defined
here has value 0 for
normal distribution.
Higher Kertusis: peaked
and fat-tailed.
Independent component analysis
Again, another view of dimension reduction is
factorization into latent variables.
ICA finds a unique solution by requiring the factors to be
statistically independent, rather than just uncorrelated.
Lack of correlation only determines the second-degree crossmoment, while statistical independence means for any
functions g1() and g2(),
For multivariate Gaussian, uncorrelatedness = independence
Multivariate Gaussian is determined by second moments alone.
Thus if the true hidden factors are Gaussian, then still they can
be determined only up to a rotation.
In ICA, the latent variables are assumed to be independent and
The matrix A must have full column rank.
Independent component analysis
ICA is a special case of PP.
The key is again for y being non-Gaussian.
Several ways to measure non-Gaussianity:
(1) Kurtotis (zero for Gaussian RV, sensitive to outliers)
(2) Entropy (Gaussian RV has the largest entropy given the
first and second moments)
H ( y )    f ( y ) log f ( y )dy
(3) Negentropy:
ygauss is a Gaussian RV with the same covariance matrix as y.
To measure statistical independence, use mutual
Sum of marginal entropies minus the overall entropy
Zero if and only if independent.
The computation:
There is no closed form solution, hence gradient descent is
Approximation to negentropy (for less intensive
computation and better resistance to outliers)
J(y) » éëE {G(y)} - E {G(v)}ùû
Two commonly used G():
G(y) = 1 log cosh(a.y)
G(y) = -exp(-a.u2 / 2)
v is standard gaussian.
G() is some nonquadratic function. When G(x)=x4 this is
Center the X vectors to mean zero.
Whiten the X vectors such that E(xx’)=I. This is done
through eigen value decomposition.
Initialize the weight vector w
until convergence
g() is the derivative of the non-quadratic function
Figure 14.28: Mixtures of independent uniform random variables. The upper left panel
shows 500 realizations from the two independent uniform sources, the upper-right panel
their mixed versions. The lower two panels show the PCA and ICA solutions, respectively.
Network component analysis
Other than dimension reduction, hidden
factor model, there is another way to
understand a model like this:
It can be understood as explaining the
data by a bipartite network --- a control
layer and an output layer.
Unlike PCA and ICA, NCA doesn’t assume
a fully linked loading matrix. Rather, the
matrix is sparse. The non-zero locations
are pre-determined by biological
knowledge about regulatory networks.
For example,
Network component analysis
Instead of blindly
search for lower
space, a priori
information is
incorporated into
the loading
Conditions for the solution to be unique:
(1) A is full column rank;
(2) When a column of A is removed, together with all
rows corresponding to non-zero values in the
column, the remaining matrix is still full column
(3) P must have full row rank
Fig. 2. A completely identifiable network (a) and an unidentifiable
network (b). Although the two initial [A] matrices describing the network
matrices have an identical number of constraints (zero entries), the
network in b does not satisfy the identifiability conditions because of the
connectivity pattern of R3. The edges in red are the differences between
the two networks.
Notice that both A and P are to be estimated. Then the
criteria of identifiability is in fact untestable.
The compute NCA, minimize the square loss function:
Z0 is the topology constraint matrix – i.e. which position
of A is non-zero. It is based on prior knowledge. It is the
network connectivity matrix.
Solving NCA:
This is a linear decomposition system which has the biconvex property. It is solved by iteratively solving for A
and P while fixing the other one.
Both steps use least squares.
Convergence is judged by the total least-square error.
The total error is non-increasing in each step. Optimality
is guaranteed if the three conditions for identifiability
are satisfied. Otherwise a local optimum may be found.
 Finding latent factors in X that can predict Y.
 X is multi-dimensional, Y can be either a random
variable or a random vector.
 The model will look like:
where Tj is a linear combination of X
 PLS is suitable in handling p>>N situation.
ak+1 is the (k+1)th eigen vector of
The PLS components minimize
Can be solved by iterative regression.
Example: PLS v.s. PCA in regression:
Y is related to X1

similar documents