Describable Visual Attributes for Face Verif

Report
Neeraj Kumar, Alexander C. Berg, Peter N. Belumeur, and
Shree K. Nayar
Presented by Gregory Teodoro

Attribute Classification
◦ Early research focused on gender and ethnicity.
 Done on small datasets
 Use of linear discriminate analysis for simple attributes
such as glasses.
 methods used to characterize or separate two or more
classes of objects or events through differences.
 “Face-prints” training was used in Support Vector
Machines to determine gender.
 Use of simple-pixel comparison operators.

Why use Attribute Classification
◦ Faces has a well-established and consistent
reference frame for image alignment
◦ Differentiating like objects is conceptually simple
 In Paper Example : Comparing two cars of the same
model could or could not be considered the same
object; two same faces however are the same object.
◦ A shared pool of attributes applicable to all faces.
 Gender, Race, Hair Color, Moustache, Eyewear, Curly,
Bangs, Eyebrow Bushiness, and so on…

Older methods used a Euclidean Distance between pairs of
images using Component Analysis; later adding in linear
discriminate analysis.
◦ Algorithms worked well, but only in controlled environments
 Pose, angel, lighting and expression caused issues in recognizing the
face.
◦ Does not work very well with “Labeled Faces in the Wild (LFW)
benchmarks.

Other methods used 2D alignment strategies, and applied
them to the LFW benchmark set, aligning all faces to each
other or pairs considered to be similar.
◦ This was computationally expensive.
◦ Paper attempts to find a far better solution and algorithm that
does not involve matching points.

Paper suggests a new method, using attribute and identity
labels to describe an image


Images were collected off the internet through a
large number of photo-sharing sites, search
engines, and the Mturk.
Downloaded images are ran through the OKAO
Face Detector which extracts faces, pose angles,
and locations of points of interest.
◦ Two corners of each eye and the mouth corners.
 Points are used to align the face and in image
transformation.

End result is the largest collection of “real-world”
faces; faces collected in a non-controlled
environment.
◦ The Columbia Face Database

Images labeled using the Amazon Mechanic
Turk (Mturk) service.
◦ A form of crowd-sourcing, each image is labeled
manually by a group of three people; only labels
where all three people agreed were used.
 Total collection of 145,000 verified positive labels.

Content-Based Image Retrieval System
◦ Difference in goal from most CBIR systems
 Most try to find objects similar to another object
 This system tries to find an object fitting a text query.
 In Paper Example : “Asian Man Smiling With Glasses”


Attributes collected by this method are not binary.
◦ Thickness of eyebrows is not a “Have” or “Have not”
situation. But rather a continuous attribute. “How thick.”
Visual attributes far more varied than names and
specific attributes; providing more possible
description overall.
◦ Black, Asian, Male, Female are specific named attributes.
eyebrow bushiness, skin shine, and age are visual
attributes.


FaceTracer is the subset of the Columbia Face
Database, containing these attribute labels. There
are 5,000 labels.
PubFig is the second dataset, of 58,797 images of
200 individuals in a variety of poses and
environments.

A set of sample images and their attributes.

Attributes are thought of as a function a[i];
mapping the images I to real values a[i].
◦ Positive values indicted strength of the ith attribute,
and negative values indicate absence.

Second form attribute called “Similes”
◦ Example : A person has “eyes like Penelope Cruz’s”.
 Forms a simile function S[cruz][eyes]

Learning attributes or simile classifiers are as
simple as fitting a function to a set of
prelabeled training data.
◦ Data must be then regularized; with bias towards
more commonly observed features.

Faces are aligned and transformed using an
affine transformation
◦ Easy to do thanks to eyes, mouth, etc.

The face is then split into 10 regions,
corresponding to feature areas, such as nose,
mouth, eyebrows, forehead, and so on.
◦ Regions are defined manually, but only once.
◦ Division of the face this way takes advantage of the
common geometry of human faces; while still allowing
for differences.
 Robust to small errors in alignment.
◦ Extracted values are normalized to lower the effect of
lighting and generalize the images.

A sample face discovered and split into regions of interest.

A sample simile comparison, and more region details.

Best features for classification chosen automatically
from a number of features.
◦ These are used to train the final attribute and simile
classifiers

Classifiers (C[i]) are built using a supervised learning
approach.
◦ Trained against a set of labeled images for each attribute,
in positive and negative.
◦ This is iterated throughout the dataset and the different
classifiers.
◦ Classifiers chosen based on cross-validation accuracy
performance.
 Features continually added until accuracy for tests stops
improving.
 For performance, the lowest-scoring 70% of classification
features are dropped to a minimum of 10 features.

Results of Gender and Smiling
Detection. (Above.)
Results of Classifiers and their
cross-validation values. (Right.)

Are these two faces of the same person?
◦ Small changes in pose, expression, and lighting can
cause false negatives.

Two images I[1] and I[2] show the same person.
◦ Verification Classifier V compares the attributes of C[I[1]]
and C[I[2]], returning v(I[1],I[2])
 These vectors are the result of concatenating the results of n
attributes.

Assumptions made
◦ C[i] for I[1] and I[2] should be similar if they are the
same, and different otherwise.
◦ Classifier Values are raw outputs of binary classifiers,
thus the sign of the value return is important.

Sample of face verification.

Let a[i] and b[i] be the outputs of the ith trait
classifier for each face.
◦ A large value must be creative that is positive or
negative depending on if this is the same
individual.

The absolute value of a[i] – b[i] nets us the
similarity results, and the product a[i]b[i]
gives us the sign.
◦ Thus…


The concatenation of this for all n
attributes/similes forms the input to the
verification classifier. V.
Training V requires a combination of positive
and negative examples.
◦ The classification function was trained using
libsvm.


Accuracy rating hovers around 85% on
average, slightly below but comparable to the
current state-of-the-art method. (86.83%)
When compared to human-based verification,
versus machine-based verification; humanbased wins out by a large margin.
◦ Algorithm when tested against the LFW had an
accuracy of 78.65%, compared to average human
accuracy which is 99.20%
 Testing was done by pulling a subset of 20,000 images
of 140 people from the LWF, and creating mutually
disjointed sets of 14 individuals.


Completely new direction of face verification
and performances comparable to state-ofthe-art algorithms already.
Further improvements can be made by using
◦ more attributes
◦ improving the training process
◦ combining attributes and simile classifiers with low
level image cues.

Questions remaining on how to apply
attributes to domains other than faces. (Cars,
houses, animals, etc.)

similar documents