Grauman - Frontiers in Computer Vision

Report
Capturing Human Insight for
Visual Learning
Kristen Grauman
Department of Computer Science
University of Texas at Austin
Frontiers in Computer Vision Workshop, MIT
August 22, 2011
Work with Sudheendra Vijayanarasimhan, Adriana
Kovashka, Devi Parikh, Prateek Jain, Sung Ju Hwang,
and Jeff Donahue
Problem: how to capture human insight
about the visual world?
• Point+label “mold” restrictive
• Human effort expensive
Annotator
[tiny image montage by Torralba et al.]
The complex space of visual objects,
activities, and scenes.
Problem: how to capture human insight
about the visual world?
Our approach:
Ask:
Actively learn
Annotator
Listen:
[tiny image montage by Torralba et al.]
The complex space of visual objects,
activities, and scenes.
Explanations,
Comparisons,
Implied
cues,…
Deepening human communication to
the system
?
What is this?
How do you
know?
What’s worth
mentioning?
<
?
Which is more ‘open’?
Do you find him
attractive? Why?
?
Is it ‘furry’?
What property is changing here?
[Donahue & Grauman ICCV 2011; Hwang & Grauman BMVC 2010; Parikh & Grauman ICCV 2011, CVPR 2011; Kovashka et al. ICCV 2011]
Soliciting rationales
• We propose to ask the annotator not just what,
but also why.
Is the team winning?
Is it a safe route?
Is her form perfect?
How can you tell?
How can you tell?
How can you tell?
Spatial
rationale
Soliciting rationales
Spatial rationale
Annotation task:
Is her form perfect?
How can you tell?
pointed toes
balanced
falling
knee angled
Attribute
rationale
Synthetic contrast example
Influence on classifier
Attribute rationale
balanced
balanced
falling
pointed
toes
pointed
toes
knee
angled
knee
angled
Synthetic contrast example
[Zaidan et al. HLT 2007]
[Donahue & Grauman, ICCV 2011]
Rationale results
• Scene Categories: How can you tell the scene category?
• Hot or Not: What makes them hot (or not)?
• Public Figures: What attributes make them (un)attractive?
Collect rationales from hundreds of MTurk workers.
[Donahue & Grauman, ICCV 2011]
Rationale results
Mean AP
Scenes
Originals
+Rationales
Kitchen
0.1196
0.1395
Living Rm
0.1142
0.1238
Inside City
0.1299
0.1487
Coast
0.4243
0.4513
Highway
0.2240
0.2379
Bedroom
0.3011
0.3167
Street
0.0778
Country
Hot or Not
Originals
+Rationales
Male
54.86%
60.01%
Female
55.99%
57.07%
0.0790
PubFig
Originals
+Rationales
0.0926
0.0950
Male
64.60%
68.14%
Mountain
0.1154
0.1158
Female
51.74%
55.65%
Office
0.1051
0.1052
Tall Building 0.0688
0.0689
Store
0.0866
0.0867
Forest
0.3956
0.4006
[Donahue & Grauman, ICCV 2011]
Learning what to mention
• Issue: presence of objects != significance
• Our idea: Learn cross-modal representation that
accounts for “what to mention”
TAGS:
Birds
Visual:
• Texture
• Scene
• Color…
Architecture
Water
Cow
Birds
Sky
Architecture
Tiles
Cow
Water
Sky
Training: human-given descriptions
Textual:
• Frequency
• Relative order
• Mutual proximity
Learning what to mention
View x
View y
Importance-aware
semantic space
[Hwang & Grauman, BMVC 2010]
Learning what to mention: results
Visual only
Words + Visual
Query
Image
Our method
[Hwang & Grauman, BMVC 2010]
Problem: how to capture human insight
about the visual world?
Our approach:
Ask:
Actively learn
Annotator
Listen:
Explanations,
Comparisons,
Implied cues
[tiny image montage by Torralba et al.]
The complex space of visual objects,
activities, and scenes.
Traditional active learning
At each cycle, obtain label for the most informative or
uncertain example. [Mackay 1992, Freund et al. 1997, Tong &
Koller 2001, Lindenbaum et al. 2004, Kapoor et al. 2007,…]
Annotator
Current
Model
Unlabeled
data
?
Active Selection
Labeled
data
Challenges in active visual learning
• Annotation tasks vary in cost and info
• Multiple annotators working parallel
• Massive unlabeled pools of data
Annotator
Current
Model
Unlabeled
data
$
$ $
$
$
$
?
Labeled
data
Active Selection
[Vijayanarasimhan & Grauman NIPS 2008, CVPR 2009, Vijayanarasimhan et al. CVPR 2010, CVPR 2011, Kovashka et al. ICCV 2011]
Sub-linear time active selection
We propose a novel hashing approach to identify
the most uncertain examples in sub-linear time.
Current
classifier
110
For 4.5 million unlabeled instances,
101
10 minutes machine
time per iter,
111
Actively
vs. 60 hours for a naïve scan.selected
Hash table
examples
Unlabeled data
[Jain, Vijayanarasimhan, Grauman, NIPS 2010]
on Flickr test set
Live active learning results
Outperforms status quo data collection approach
[Vijayanarasimhan & Grauman, CVPR 2011]
Summary
• Humans are not simply “label machines”
• Widen access to visual knowledge
– New forms of input, often requiring associated new
learning algorithms
• Manage large-scale annotation efficiently
– Cost-sensitive active question asking
• Live learning: moving beyond canned datasets

similar documents