Trevor Darrell UC Berkeley July Meeting Slides

Report
MURI Update 7/26/12
UC Berkeley
Prof. Trevor Darrell (UCB)
Prof. Michael Jordan (UCB)
Prof. Brian Kulis (fmr UCB postdoc,
now Asst. Prof. OSU)
Dr. Stefanie Jegelka (UCB postdoc)
Recent Effort
• NPB inference of visual structures
- Objects
- Trajectories
• Develop richer representations (appearance,
shape)
• Efficient implementation in constrained
environments
• Distributed and decentralized variants
NPB Trajectory Models
• Extend NPB models to
trajectory domains:
– find structure in motion trajectories
– identify anomalies
– Considering Bluegrass data and
possibly ARGUS track data via LLNL
collaboration
• HDP and hard clustering
variants
• Decentralized / distributed
implementation
topics
topics
New Algorithms for Hard Clustering
via Bayesian Nonparametrics
Make Bayesian, Take
Limit
Mixture of
Gaussians
DP
Mixture
k-means
??
Fix
Covariances,
Take Limit
Generalized to Hard HDP as well;
[Kulis and Jordan ICML2012]
• Kulis Presentation
Small Variance Asymptotics,
Bayesian Nonparametrics, and
k-means
Brian Kulis
Ohio State University
Joint work with Michael Jordan (Berkeley), Ke
Jiang (OSU), and Tamara Broderick (Berkeley)
Generalizing k-means
Make Bayesian, Take
Limit
Mixture of
Gaussians
DP Mixture
k-means
????
Fix
Covariances,
Take Limit
[Kulis and Jordan, ICML 2012]
• For
“k-means
people”
Why
Should
We Care?
• Bayesian techniques permit great
flexibility
• Extensions to multiple data sets
would not be obvious without this
connection
• For Bayesians
• Hard clustering methods scale
better
• Connections to graph cuts and
spectral methods
Gaussian Mixture Models
•
•
No closed-form
for optimizing
likelihood
Typically resort
to EM algorithm
EM Algorithm
E-Step
M-Step
k-means
Mixture of
Gaussians
Fix
Covariances,
Take Limit
k-means
In the limit as sigma goes to 0, this
value is 1 if centroid c is the closest, and
0 otherwise
Linear-Gaussian Latent
Feature Models
Latent Variables:
Data:
Likelihood:
For mixture of Gaussians:
Log-Likelihood Asymptotics
Other Linear-Gaussian
Models
Let Z have a continuous distribution
(Essentially) probabilistic PCA
As sigma goes to 0, get standard PCA
The Polya Urn
• Imagine an urn with theta balls for each
of k colors
• Pick a ball from the urn, replace the ball
and add another ball of that color to the
urn
• Repeat n times
• Induces a distribution over Z
The Dirichlet-Multinomial
Distribution
(If theta an integer, gamma functions are
factorials.)
Small-Variance Asymptotics
Nothing interesting happens unless:
In this case, we obtain:
The Chinese Restaurant
Process
• Customers sequentially enter the
1
2
3
5
6
...
4
restaurant
• First customer sits at first table
• Subsequent customers sit at occupied
table with probability proportional to
number of occupants
• Start a new table with probability
The Chinese Restaurant
Process
The exchangeable partition
probability function (EPPF):
Select theta as before,
yields asymptotically:
Algorithmic Perspective:
Gibbs Sampling
• Suppose we want to sample from p(x),
where
• Repeatedly sample as follows
Collapsed Gibbs Sampling
for CRP/DP Mixtures
• Want to sample from the posterior:
• Need the following to do Gibbs
sampling:
Asymptotics of the Gibbs
Sampler
• Now, would like to see what
DP Mixture
happens when sigma goes to 0
• Need 1 additional thing
• must be a function of sigma:
????
Asymptotics of the Gibbs
Sampler
Existing Clusters
In the limit:
Start a New Cluster
DP-means
Underlying Objective
Function
Theorem: The DP-means algorithm
monotonically minimizes this objective until
local convergence.
3 Gaussians
Clustering with Multiple Data
Sets
Want to cluster each data set, but also want
to share cluster structure
Use the Hierarchical Dirichlet Process (HDP)!
The Hard Gaussian HDP:
Objective
Theorem: The Hard HDP algorithm
monotonically minimizes this objective until
local convergence.
Exponential Families
• What happens when we replace the
Gaussian likelihood with an arbitrary
exponential family distribution?
• Do asymptotics where the mean is fixed
but the covariance goes to 0:
Exponential Families
• Use the standard EF conjugate prior
• Utilize Bregman divergences:
• Asymptotics of Gibbs require Laplace’s
method on the marginal likelihood
• End up with same result as before, with
Bregman divergence replacing sq.
Euclidean
Illustration: Hard Topic
Models
Overlapping Clusters /
Binary Feature Models
So far, have considered
single assignment models
What if each point can be
assigned to multiple
clusters?
Overlapping Clusters
Single Assignment
Multi Assignment
Non-Bayesian
Bayesian
Bayesian
Nonparametric
Can perform analogous small-variance asymptotics using
the above multi-assignment distributions
Indian Buffet Process
• First customer samples
• Subsequent customer i samples
dishes
existing dishes with probability equal to
fraction of previous customers sampling
that dish
• Also samples
new dishes
Indian Buffet Process SmallVariance Asymptotics
Number of points
possessing feature c
Number of new dishes
sampled by customer i
Indian Buffet Process SmallVariance Asymptotics
(Reduces to DPmeans obj. for
single assignments)
Spectral Clustering
Connections
• Standard spectral relaxation
• Write as
• Relax Y to be an arbitrary
orthogonal matrix
• Take matrix of top k eigenvectors of
Spectral Relaxation for DPmeans
• How do we extend this?
• Write as
• Relax Y to be an arbitrary
orthogonal matrix
• Take matrix of eigenvectors of K
whose eigenvalues are greater than
Graph Clustering
• Can further take advantage of
connections between k-means and
graph clustering
• Generalize hard DP objective to
weighted and kernel forms
• Special cases lead to graph clustering
objectives that penalize the number of
clusters in the graph
Conclusions and Open
Questions
• Focus was on obtaining nonprobabilistic models from probabilistic
ones
• Small-variance asymptotics for
Bayesian nonparametrics yields models
which regularize / penalize
• Number of clusters
• Number of features
• Number of topics
Conclusions and Open
Questions
• Spectral or semidefinite relaxation for
the hard HDP?
• Better algorithms?
• Local Search
• Multilevel Methods
• Split/Merge
• Incorporate ideas from other
inference schemes
• Applications?
Thanks!
2) Decentralized models
• Distributed Hard Topic Models
• Jegelka presentation
Decentralized k-means algorithms
• Data distributed
• Clusters shared globally
– Cluster assignment
– Update centers
Communication
Where store centers?
Decentralized k-means
• Local copies, gossip [Datta et al,…]
– Local assignment & update
– Sharing & averaging with neighbor(s)
Decentralized k-means
• Local copies, gossip [Datta et al,…]
• Single copy of each mean
– Summaries & pruning for restricted comparisons
[Papatreou et al.]
New questions
• Decentralized cluster creation?
– Sequential vs. parallel
• Naturally hierarchical
• Exploit structure to reduce communication
– Partial sharing, e.g. locality
– Integration into model?
• Generalization to IBP, PYP, …
Decentralized clustering
• Observe trajectories
in a scene
• Cluster locations by
local traffic behavior:
HDP
 anomalies, traffic
prediction, …
• Clusters are not
omnipresent:
partial sharing
3) Trajectory Data
• Scalable Topic Models for Trajectories
• Apply current results on scalable topic modeling to
problems in vision
– find structure in motion trajectories
– identify anomalies
– Considering Bluegrass data and possibly ARGUS track
data via LLNL collaboration
Trajectory Video
Experimental Results
Track Fragment
Road Network
Vocabulary (Colored by
50
Experimental Results
Highlights
Query
Query
Road
the
trajectory
vocabulary
trajectory
networkwithassigned to the q
vocabulary below
51
Querying Results
Note this track “word”
Northbound Trajectory stops at intersection, tu
52
Querying Results
Eastbound Trajectory
53
Querying Results
Eastbound Trajectory, Stops at Traffic L
54
Querying Results
Westbound Trajectory, Stops at Traffic L
55
Right-Hand Turn
Trajectories on Map
K-Nearest Neighbors
QUERY
1 234 5
1
0
2
0
Trajectory segments are 10 seconds in length, sampled every 2.5 seconds (75% overlap).
U-Turn
Trajectories on Map
K-Nearest Neighbors
1 2 34 5
1
0
QUERY
2
0
Trajectory segments are 10 seconds in length, sampled every 2.5 seconds (75% overlap).
Coda: Timely Detection
[Karayev and Darrell, in review]

similar documents