Report

Learning Scalable Discriminative Dictionaries with Sample Relatedness a.k.a. “Infinite Attributes” Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor Darrell Attribute Learning striped water white furry bright wheels • generalizable vs. discriminative • which attributes to use? (Lampert, Nickisch & Harmeling, 2009; Farhadi &Forsyth, 2009; Parikh & Grauman, 2011 …) Attribute Generative Model Cartoon cup face car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon cup face car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye mouth mouth attributes features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye attributes mouth mouth features …edges… Attribute Generative Model Cartoon face cup car … objects eye eye eye eye nose mouth nose eye nose eye attributes mouth mouth features …edges… Goals I • Flexibility: automatically determine the attributes – as expressive as needed, as compact as possible non-parametric Bayesian striped water white Animals furry striped water white Humans furry ? Goals II • Efficiently learnable: few positive training samples – reduce sample complexity Related samples Samoyed dog Pug dog Knowledge transfer via attributes Corgi dog Goals III • Discriminative: object classification task max margin + + + - - + + - Outline • Non-parametric Bayesian for flexible attribute learning • Sample relatedness for knowledge transfer • Discriminative generative model Preliminaries: Non-parametric Bayesian • Bayesian rule applied in machine learning likelihood of prior probability of posterior of given • Model comparison for model selection: • Prediction: Non-parametric Bayesian Models • Inflexible models yield unreasonable inferences. • Non-parametric models can automatically infer an adequate model size/complexity from the data, without needing to explicitly do Bayesian model comparison. • Many can be derived by starting with a finite parametric model and taking the limit as number of parameters Finite Mixture Model • • • • Set of observations: Constant clusters, Cluster assignment for is The probability of each sample: • The likelihood of samples: Infinite Mixture Model • Infinite clusters likelihood – It is like saying that we have: • Since we always have limited samples in reality, we will have limited number of clusters used; so we define two sets of clusters: – – numbers of classes for which numbers of classes for which • Assume a reordering, such that K K0 Finite Feature Model • Generating : binary matrix • Distribution of Integrate out , leaving: customers – For each column , draw from beta distribution – For each customer, flip a coin by features Finite Feature Model • Generating : binary matrix • is sparse customers – For each column , draw from beta distribution – For each customer, flip a coin by Even , the matrix is expected to have a finite Number of non-zero elements. features From Finite to Infinite Binary Matrices • A technical difficulty: the probability for any particular matrix goes to zero as • However, if consider equivalent classes of matrices in leftordered form obtained by reordering the columns: – is the number of features assigned – is the th harmonic number – This distribution is exchangeable, independent of the ordering. From Finite to Infinite Binary Matrices a) The binary matrix on the left is transformed to the binary matrix on the right by the function lof(). b) A left-ordered binary matrix generated by Indian Buffet Process. customers Indian Buffet Process Buffet dishes “Many Indian restaurants offer lunchtime Buffets with an apparently infinite number of dishes.” • First customer starts at the left of the buffet, and takes a serving from each dish, stopping after a number of dishes as her plate becomes overburdened. • The i-th customer moves along the buffet, sampling dishes in proportion to their popularity, with probability , and trying a number of new dishes. Non-parametric Learning Infinite attributes – Indian Buffet Process prior • prob (image n samples attribute k) • sample new attributes striped white • Likelihood: , - … - wheels striped furry , bright wheels (Griffiths & Ghahramani, 2006) Asymptotic Model • prob(image n samples attribute k) • sample new attributes striped • Asymptotics ¾ 2 ! 0 white furry data binary assignments dictionary dictionary size is determined automatically (Broderick, Kulis, Jordan ICML 2013) bright wheels Asymptotics Mixture of Gaussians Bayesian, nonparametric DP mixture flexible, principled cov zero simple, efficient, “practical” k-means Principled discrete criteria from BNP: • Dirichlet Process k-means + penalty • Beta Process squared loss + penalty • Dependent Dirichlet Process ?? (Kulis & Jordan, ICML 2012) (Broderick, Kulis & Jordan, ICML 2013) (Campbell, Liu, Kulis, How, Carin, NIPS 2013) Sample Relatedness polar bear Related samples ? clown fish path length in WordNet ? motorbike (Christiane Fellbaum, WordNet, 1998) Full Model + Classifiers + + Samoyed dog Cat … Attributes Input features discriminative Pug dog Pug Dog Positive Samples - … Samoyed Dog Cat Related Samples Negative Samples attributes with sample relatedness Joint Learning of Dictionary & Classifiers • BCD: alternatingly update classifiers & dictionary w h i l e not conver ged d o 1: up dat e z i k 2 f 0; 1g gr eedil y 2: A ! X Z > ( Z Z > ) ¡ 1 3: sam p le a new at t r i but e a k + 1 : p( a k + 1 = x i ) / kx i ¡ A z i k 2 4: A Ã [A ; a k + 1 ] en d Does It Work? classification accuracy on ImageNet sample-efficient: higher accuracy with fewer training samples generalization better representation of new classes AwA data Why Does It Work? more related information more data 15 20 25 30 50 using related samples increases sample-efficiency Why Does It Work? non-parametric: adapts to complexity of the data representation-efficient Conclusions • Flexible attribute learning method – generalize to new categories – adaptive to the dataset complexity • Efficiently learnable – sample efficiency – reduce the user annotation effort • Perform Well – recognize existing and new categories well Thanks! Q&A