### Lecture 11 Face Recognition CCA LDA Kernel PCA, ICA MDS/LLE

```EE462 MLCV
Lecture 13-14
Face Recognition
Subspace/Manifold Learning
Tae-Kyun Kim
1
EE462 MLCV
Face Recognition Applications
• Applications include
– Automatic face tagging at
commercial weblogs
– Face image retrieval in MPEG7 (our
solution is MPEG7 standard)
– Automatic passport control
– Feature length film character
summarisation
• A key issue is in the efficient
representation of face images.
2
EE462 MLCV
Face Recognition vs Object Categorisation
Class 2
Class 1
Face image
data sets
Intraclass
variation
Interclass
variation
Class 2
Class 1
Object categorisation data sets
Intraclass
variation
Interclass
variation
3
EE462 MLCV
Face Recognition vs Object Categorisation
In both, we try representations/features that minimise
intraclass variations and maximise interclass variations.
Face image variations are more subtle, compared to
those of generic object categories.
Subspace/manifold techniques, cf. Bag of Words, are
dominating-arts for face image analysis.
4
EE462 MLCV
Principal Component Analysis (PCA)
Maximum Variance Formulation
Minimum-error formulation
Probabilistic PCA
5
EE462 MLCV
Maximum Variance Formulation of PCA
• PCA (also known as Karhunen-Loeve transform) is a
technique for dimensionality reduction, lossy data
compression, feature extraction, and data visualisation.
• PCA can be defined as the orthogonal projection of the data
onto a lower dimensional linear space such that the variance
of the projected data is maximised.
6
EE462 MLCV
• Given a data set {xn}, n = 1,...,N and xn ∈ RD, our goal is to
project the data onto a space of dimension M << D while
maximising the projected data variance.
For simplicity, M = 1. The direction of this space is defined by
a vector u1 ∈ RD s.t. u1Tu1 = 1.
Each data point xn is then projected onto a scalar value
u1Txn.
7
EE462 MLCV
The mean is
,
where
The variance is given by
where S is the data covariance matrix defined as
8
EE462 MLCV
We maximise the projected variance u1TSu1 with respect to
u1 with the normalisation condition u1Tu1 = 1.
The Lagrange multiplier formulation is
By setting the derivative with respect to u1 to zeros, we obtain
u1 is an eigenvector of S.
By multiplying u1T , the variance is obtained by
9
EE462 MLCV
The variance is a maximum when u1 is the eigenvector with the
largest eigenvalue λ1.
The eigenvector is called the principal component.
For the general case of an M dimensional subspace, it is
obtained by the M eigenvectors u1, u2, … , uM of the data
covariance matrix S corresponding to the M largest
eigenvalues λ1, λ2 …, λM.
1
2
= 1,   =
0, otherwise
10
EE462 MLCV
Minimum-error formulation of PCA
• Alternative (equivalent) formulation of PCA is to minimise the
projection error. We consider an orthonormal set of Ddimensional basis vectors {ui}, i=1,...,D s.t.
= 1,   =
0, otherwise
• Each data point is represented by a linear combination of the
basis vectors
11
EE462 MLCV
• The coefficients αni = xnTui, and without loss of generality we
have
Our goal is to approximate the data point using M << D.
Using M-dimensional linear subspace, we write each data
point as
where bi are constants for all data points.
12
EE462 MLCV
• We minimise the distortion measure
with repsect to ui, zni, bi.
Setting the derivative with respect to znj to zero, from the
orthonormality conditions, we have
where j = 1, … , M.
Setting the derivative of J w.r.t. bi to zero gives
where j = M + 1, … , D.
13
EE462 MLCV
If we substitute for zni and bi, we have
We see that the displacement vectors lie in the space
orthogonal to the principal subspace, as it is a linear
combination of ui ,where i = M + 1, … , D.
We further get
14
EE462 MLCV
• Consider a two-dimensional data space D = 2 and a onedimensional principal subspace M = 1. Then, we choose
u2 that minimises
Setting the derivative w.r.t. u2 to zeros yields Su2 = λ2u2
We therefore obtain the minimum value of J by choosing
u2 as the eigenvector corresponding to the smaller
eigenvalue.
We choose the principal subspace by the eigenvector
with the larger eigenvalue.
15
EE462 MLCV
• The general solution is to choose the eigenvectors of the
covariance matrix with M largest eigenvalues.
where I = 1, ... ,M.
The distortion measure becomes
16
EE462 MLCV
Applications of PCA to Face
Recognition
17
EE462 MLCV
(Recap) Geometrical interpretation of PCA
• Principal components are the vectors in the direction of the
maximum variance of the projection data.
2
• For given 2D data points,
u1 and u2 are found as
PCs.
1
2
1
• For dimension reduction,
Each 2D data point is transformed to a single variable z1
representing the projection of the data point onto the eigenvector
u1.
The data points projected onto u1 has the max variance.
• PCA infers the inherent structure of high dimensional data.
• The intrinsic dimensionality of data is much smaller.
18
EE462 MLCV
Eigenfaces
• Collect a set of face images.
• Normalize for scale, orientation, location (using eye locations), and
vectorise them.
D=wh





w
h





X R
DN
N: number of images
• Construct the covariance
matrix S and obtain eigenvectors U.
S 
1
N
T
X X  , X   ..., x i  x ,... 
SU   U ,
U R
DM
M: number of eigenvectors
19
EE462 MLCV
Eigenfaces
• Project data onto the
subspace
Z U X,
T
ZR
M N
, M  D
• Reconstruction is obtained
as
~
x 
M

z i u i  Uz ,
~
X  UZ
i 1
• Use the distance to the
subspace for face
recognition
x
|| x  ~
x ||
~x
20
EE462 MLCV
Eigenfaces
Method 1
• Given face images of different classes (i.e. identities), ci, compute the
principal (eigen) subspace per class.
• A query (test) image, x, is projected on each eigen-subspace and its
reconstruction error is measured.
• The class that has the minimum error is assigned.
c1
c2
PCA
_1
x
c3
_3
_2
_ : reconstruction by cth class subspace
assign arg  min |  − _ |
21
EE462 MLCV
Eigenfaces
Method 2
• Given face images of different classes (i.e. identities), ci, compute the
principal (eigen) subspace over all data.
• A query (test) image, x, is projected on the eigen-subspace and its
projection, z, is compared with the projections of the class means.
• The class that has the minimum error is assigned.
c1
x
c2
PCA
c3
_1

_3
_2
_ : projection of c-th
class data mean
assign arg  min |  − _ |
22
EE462 MLCV
Matlab Demos
Face Recognition by PCA
•
•
•
•
•
Face Images
Eigenvectors and Eigenvalue plot
Face image reconstruction
Projection coefficients (visualisation of high-dimensional
data)
Face recognition
23
EE462 MLCV
Probabilistic PCA (PPCA)
• A subspace is spanned by
the orthonormal basis
(eigenvectors computed
from covariance matrix).
• It interprets each
observation with a
generative model.
• It estimates the probability of
generating each observation
with Gaussian distribution,
PCA: uniform
prior on the
subspace
PPCA: Gaussian
dist. on the
subspace
24
EE462 MLCV
Continuous Latent Variable Model
• PPCA has a continuous latent variable.
• GMM (mixture of Gaussians) is the model with a discrete
latent variable. Lecture 3-4
• PPCA represents that the original data points lie close to
a manifold of much lower dimensionality.
• In practice, the data points will not be confined precisely
to a smooth low-dimensional manifold. We interpret the
departures of data points from the manifold as noise.
25
EE462 MLCV
Continuous Latent Variable Model
• Consider an example of digit images that undergo a random
displacement and rotation.
• The images have the size of 100 x 100 pixel values, but the degree of
freedom of variability across images is only three: vertical, horizontal
translations and rotations.
• The data points live on a subspace whose intrinsic dimensionality is
three.
• The translation and rotation parameters are continuous latent
(hidden) variables. We only observe the image vectors.
26
EE462 MLCV
Probabilistic PCA
• PPCA is an example of the linear-Gaussian framework, in
which all marginal and conditional distributions are Gaussian.
Lecture 15-16
• We define a Gaussian prior distribution over the latent
variable z as
The observed D dimensional variable x is defined as
where z is an M dimensional Gaussian latent variable, W
is the D x M matrix and ε is a D dimensional zero-mean
Gaussian-distributed noise variable with covariance σ2I.
27
EE462 MLCV
• The conditional distribution takes the Gaussian form
This is a generative process on a mapping from latent space
to data space, in contrast to the conventional view of PCA.
• The marginal distribution is written in the form
From the linear-Gaussian model, the marginal distribution is
again Gaussian as
where
28
EE462 MLCV
The above can be seen from
29
EE462 MLCV
30
EE462 MLCV
Maximum likelihood Estimation for PPCA
• We need to determine the parameters μ, W and σ2, which
maximise the log-likelihood.
• Given a data set X = {xn} of observed data points, PPCA
can be expressed as a directed graph.
31
EE462 MLCV
The log likelihood is
For detailed optimisations, see Tipping and Bishop, PPCA
(1999).
where UM is the D x M eigenvector matrix of S, and LM is the M x M
diagonal eigenvalue matrix, R is an orthogonal rotation matrix s.t. RRT= I.
32
EE462 MLCV
Redundancy happens up to rotations, R, of the latent space
coordinates.
Consider a matrix
where R is an orthogonal
rotation matrix s.t. RRT= I. We see
Hence, it is independent of R.
33
EE462 MLCV
• Conventional PCA is generally formulated as a projection of
points from the D dimensional data space onto an M
dimensional linear subspace.
• PPCA is most naturally expressed as a mapping from the
latent space to the data space.
• We can reverse this mapping using Bayes' theorem to get
the posterior distribution p(z|x) as
where the M x M matrix M is defined by
34
EE462 MLCV
Limitations of PCA
35
EE462 MLCV
Unsupervised learning
PCA finds the direction for maximum variance of data (unsupervised),
while LDA (Linear Discriminant Analysis) finds the direction that
optimally separates data of different classes (supervised).
PCA vs LDA
36
Linear model
EE462 MLCV
PCA is a linear projection method.
When data lies in a nonlinear manifold, PCA is extended to Kernel PCA
by the kernel trick. Lecture 9-10
()
Linear
Manifold =
Subspace
PCA vs Kernel PCA
Nonlinear
Manifold
37
EE462 MLCV
Gaussian assumption
PCA models data as Gaussian distributions (2nd order statistics),
whereas ICA (Independent Component Analysis) captures higher-order
statistics.
PC2
IC1
ICA
PCA
IC2
PC1
PCA vs ICA
38
EE462 MLCV
Holistic bases
PCA bases are holistic (cf. part-based) and less intuitive.
ICA or NMF (Non-negative Matrix Factorisation) yields bases, which capture
local facial components.
(or ICA)
Daniel D. Lee and H.
Sebastian Seung (1999).
"Learning the parts of objects
by non-negative matrix
factorization". Nature 401 (675
5): 788–791.
39
```