Report

Efficient Large-Scale Structured Learning Steve Branson Oscar Beijbom Serge Belongie Caltech UC San Diego UC San Diego CVPR 2013, Portland, Oregon Overview • Structured prediction • Learning from larger datasets Deformable part models Mammal TINY IMAGES Primate Orangutan Large Datasets Object detection Gorilla Hoofed Mammal Odd-toed Cost sensitive Learning Even-toed Overview • Available tools for structured learning not as refined as tools for binary classification • 2 sources of speed improvement – Faster stochastic dual optimization algorithms – Application-specific importance sampling routine Mammal Primate Oranguta n Gorilla Hoofed Mammal Odd-toed Even-toed Summary • Usually, train time = 1-10 times test time • Publicly available software package – Fast algorithms for multiclass SVMs, DPMs – API to adapt to new applications – Support datasets too large to fit in memory – Network interface for online & active learning Mammal Primate Oranguta n Gorilla Hoofed Mammal Odd-toed Even-toed Summary Mammal Primate Orangutan Deformable part models • 50-1000 faster than – SVMstruct – Mining hard negatives – SGD-PEGASOS Hoofed Mammal Gorilla Odd-toed Even-toed Cost-sensitive multiclass SVM • 10-50 times faster than SVMstruct • As fast as 1-vs-all binary SVM Object Detection, Pose Registration, Attribute Prediction, etc. Binary Learner SVM, Boosting, Logistic Regression, etc. BINARY OUTPUT Structured Dataset BINARY DATASET Binary vs. Structured Structured Output = −1 = +1 = (, , , ℎ) Object Detection, Pose Registration, Attribute Prediction, etc. Binary Learner SVM, Boosting, Logistic Regression, etc. BINARY OUTPUT Structured Dataset BINARY DATASET Binary vs. Structured Structured Output • Pros: binary classifier is application independent • Cons: what is lost in terms of: – Accuracy at convergence? – Computational efficiency? Binary vs. Structured ≈ ≈ Structured Prediction Loss ∆((), ) Binary Loss ∆01 ∆01 Convex Upper Bound . . hinge, exponential loss Source of Computational Speed Binary vs. Structured ≈ ≈ Structured Prediction Loss ∆((), ) Binary Loss ∆01 ≈ ∆01 Convex Upper Bound . . hinge, exponential loss ℓ(; ) ∆((), ) Convex Upper Bound on Structured Prediction Loss Binary vs. Structured Application-specific optimization algorithms that: – Converge to lower test error than binary solutions – Lower test error for all amounts of train time Binary vs. Structured Application-specific optimization algorithms that: – Converge to lower test error than binary solutions – Lower test error for all amounts of train time Structured SVM • SVMs w/ structured output [Tsochantaridis et al. ICML’04] • Max-margin MRF [Taskar et al. NIPS’03] Binary SVM Solvers Faster Linear SVM Solvers Non − Linear Sequential or Kernel ≫ Cutting Plane > SGD ≥ Stochastic Dual LIBSVM, SVMperf PEGASOS LIBLINEAR SVMlight Quadratic to linear in trainset size SVMstruct Binary SVM Solvers Faster Linear SVM Solvers Non − Linear Sequential or Kernel ≫ Cutting Plane > SGD ≥ Stochastic Dual LIBSVM, SVMperf PEGASOS LIBLINEAR SVMlight Quadratic to linear in trainset size Linear to independent in trainset size SVMstruct Binary SVM Solvers Faster Linear SVM Solvers Non − Linear Sequential or Kernel ≫ Cutting Plane > SGD ≥ Stochastic Dual LIBSVM, SVMperf PEGASOS LIBLINEAR SVMlight Quadratic to linear in trainset size Linear to • Faster on multiple passes independent • Detect convergence in trainset size • Less sensitive to regularization/learning rate SVMstruct Structured SVM Solvers Faster Linear SVM Solvers Non − Linear Sequential or Kernel ≫ Cutting Plane > SGD ≥ Stochastic Dual LIBSVM, SVMperf PEGASOS LIBLINEAR SVMlight Applied to SSVMs Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13] Our Approach • Use faster stochastic dual algorithms • Incorporate application-specific importance sampling routine – Reduce train times when prediction time T is large – Incorporate tricks people use for binary methods Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample Our Approach For t=1… do 1. Choose random training example (Xi,Yi) 2. 1 , 2 ,…, ←ImportanceSample( , ; −1 ) 3. Approx. maximize Dual SSVM objective w.r.t. i end (Provably fast convergence for simple approx. solver) Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample Recent Papers w/ Similar Ideas • Augmenting cutting plane SSVM w/ m-best solutions A. Guzman-Rivera, P. Kohli, D. Batra. “DivMCuts…” AISTATS’13. • Applying stochastic dual methods to SSVMs S. Lacoste-Julien, et al. “Block-Coordinate Frank-Wolfe…” JMLR’13 . Applying to New Problems 1. Define loss function Δ , 2. Implement feature extraction routine , 3. Implement importance sampling routine 1. Loss function 2. Features 3. Importance sampling routine Applying to New Problems 3. Implement importance sampling routine a) Is fast b) Favor samples w/ • High loss , +Δ , • Uncorrelated features: small , ∙ , Example: Object Detection 1. Loss function (∩ ) Δ , = (∪ ) 2. Features 3. Importance sampling routine , • Add sliding window & loss into dense score map • Greedy NMS Example: Deformable Part Models 1. Loss function Δ , = sum of part losses 2. Features , 3. Importance sampling routine • Dynamic programming • Modified NMS to return diverse set of poses Cost-Sensitive Multiclass SVM cat dog ant fly car bus 1. Loss function Class confusion cost Δ , = 4 cat dog ant fly car bus 2. Features e.g., bag-ofwords 3. Importance sampling routine • Return all classes • Exact solution using 1 dot product per class Results: CUB-200-2011 • Pose mixture model, 312 part/pose detectors • Occlusion/visibility model • Tree-structured DPM w/ exact inference Results: CUB-200-2011 5794 training examples 400 training examples • ~100X faster than mining hard negatives and SVMstruct • 10-50X faster than stochastic sub-gradient methods • Close to convergence at 1 pass through training set Results: ImageNet Comparison to other fast linear SVM solvers Comparison to other methods for cost-sensitive SVMs • Faster than LIBLINEAR, PEGASOS • 50X faster than SVMstruct Conclusion • Orders of magnitude faster than SVMstruct • Publicly available software package – Fast algorithms for multiclass SVMs, DPMs – API to adapt to new applications – Support datasets too large to fit in memory – Network interface for online & active learning Mammal Primate Oranguta n Gorilla Hoofed Mammal Odd-toed Even-toed Thanks!