Describable Visual Attributes for Face Verif

Neeraj Kumar, Alexander C. Berg, Peter N. Belumeur, and
Shree K. Nayar
Presented by Gregory Teodoro
Attribute Classification
◦ Early research focused on gender and ethnicity.
 Done on small datasets
 Use of linear discriminate analysis for simple attributes
such as glasses.
 methods used to characterize or separate two or more
classes of objects or events through differences.
 “Face-prints” training was used in Support Vector
Machines to determine gender.
 Use of simple-pixel comparison operators.
Why use Attribute Classification
◦ Faces has a well-established and consistent
reference frame for image alignment
◦ Differentiating like objects is conceptually simple
 In Paper Example : Comparing two cars of the same
model could or could not be considered the same
object; two same faces however are the same object.
◦ A shared pool of attributes applicable to all faces.
 Gender, Race, Hair Color, Moustache, Eyewear, Curly,
Bangs, Eyebrow Bushiness, and so on…
Older methods used a Euclidean Distance between pairs of
images using Component Analysis; later adding in linear
discriminate analysis.
◦ Algorithms worked well, but only in controlled environments
 Pose, angel, lighting and expression caused issues in recognizing the
◦ Does not work very well with “Labeled Faces in the Wild (LFW)
Other methods used 2D alignment strategies, and applied
them to the LFW benchmark set, aligning all faces to each
other or pairs considered to be similar.
◦ This was computationally expensive.
◦ Paper attempts to find a far better solution and algorithm that
does not involve matching points.
Paper suggests a new method, using attribute and identity
labels to describe an image
Images were collected off the internet through a
large number of photo-sharing sites, search
engines, and the Mturk.
Downloaded images are ran through the OKAO
Face Detector which extracts faces, pose angles,
and locations of points of interest.
◦ Two corners of each eye and the mouth corners.
 Points are used to align the face and in image
End result is the largest collection of “real-world”
faces; faces collected in a non-controlled
◦ The Columbia Face Database
Images labeled using the Amazon Mechanic
Turk (Mturk) service.
◦ A form of crowd-sourcing, each image is labeled
manually by a group of three people; only labels
where all three people agreed were used.
 Total collection of 145,000 verified positive labels.
Content-Based Image Retrieval System
◦ Difference in goal from most CBIR systems
 Most try to find objects similar to another object
 This system tries to find an object fitting a text query.
 In Paper Example : “Asian Man Smiling With Glasses”
Attributes collected by this method are not binary.
◦ Thickness of eyebrows is not a “Have” or “Have not”
situation. But rather a continuous attribute. “How thick.”
Visual attributes far more varied than names and
specific attributes; providing more possible
description overall.
◦ Black, Asian, Male, Female are specific named attributes.
eyebrow bushiness, skin shine, and age are visual
FaceTracer is the subset of the Columbia Face
Database, containing these attribute labels. There
are 5,000 labels.
PubFig is the second dataset, of 58,797 images of
200 individuals in a variety of poses and
A set of sample images and their attributes.
Attributes are thought of as a function a[i];
mapping the images I to real values a[i].
◦ Positive values indicted strength of the ith attribute,
and negative values indicate absence.
Second form attribute called “Similes”
◦ Example : A person has “eyes like Penelope Cruz’s”.
 Forms a simile function S[cruz][eyes]
Learning attributes or simile classifiers are as
simple as fitting a function to a set of
prelabeled training data.
◦ Data must be then regularized; with bias towards
more commonly observed features.
Faces are aligned and transformed using an
affine transformation
◦ Easy to do thanks to eyes, mouth, etc.
The face is then split into 10 regions,
corresponding to feature areas, such as nose,
mouth, eyebrows, forehead, and so on.
◦ Regions are defined manually, but only once.
◦ Division of the face this way takes advantage of the
common geometry of human faces; while still allowing
for differences.
 Robust to small errors in alignment.
◦ Extracted values are normalized to lower the effect of
lighting and generalize the images.
A sample face discovered and split into regions of interest.
A sample simile comparison, and more region details.
Best features for classification chosen automatically
from a number of features.
◦ These are used to train the final attribute and simile
Classifiers (C[i]) are built using a supervised learning
◦ Trained against a set of labeled images for each attribute,
in positive and negative.
◦ This is iterated throughout the dataset and the different
◦ Classifiers chosen based on cross-validation accuracy
 Features continually added until accuracy for tests stops
 For performance, the lowest-scoring 70% of classification
features are dropped to a minimum of 10 features.
Results of Gender and Smiling
Detection. (Above.)
Results of Classifiers and their
cross-validation values. (Right.)
Are these two faces of the same person?
◦ Small changes in pose, expression, and lighting can
cause false negatives.
Two images I[1] and I[2] show the same person.
◦ Verification Classifier V compares the attributes of C[I[1]]
and C[I[2]], returning v(I[1],I[2])
 These vectors are the result of concatenating the results of n
Assumptions made
◦ C[i] for I[1] and I[2] should be similar if they are the
same, and different otherwise.
◦ Classifier Values are raw outputs of binary classifiers,
thus the sign of the value return is important.
Sample of face verification.
Let a[i] and b[i] be the outputs of the ith trait
classifier for each face.
◦ A large value must be creative that is positive or
negative depending on if this is the same
The absolute value of a[i] – b[i] nets us the
similarity results, and the product a[i]b[i]
gives us the sign.
◦ Thus…
The concatenation of this for all n
attributes/similes forms the input to the
verification classifier. V.
Training V requires a combination of positive
and negative examples.
◦ The classification function was trained using
Accuracy rating hovers around 85% on
average, slightly below but comparable to the
current state-of-the-art method. (86.83%)
When compared to human-based verification,
versus machine-based verification; humanbased wins out by a large margin.
◦ Algorithm when tested against the LFW had an
accuracy of 78.65%, compared to average human
accuracy which is 99.20%
 Testing was done by pulling a subset of 20,000 images
of 140 people from the LWF, and creating mutually
disjointed sets of 14 individuals.
Completely new direction of face verification
and performances comparable to state-ofthe-art algorithms already.
Further improvements can be made by using
◦ more attributes
◦ improving the training process
◦ combining attributes and simile classifiers with low
level image cues.
Questions remaining on how to apply
attributes to domains other than faces. (Cars,
houses, animals, etc.)

similar documents