Report

Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T. OUTLINE • Reducing Row Count • Row Sampling and Leverage Scores • Adaptive Uniform Sampling • Proof of Structural Theorem LARGE MATRICES Key routine in applications: size reductions via sampling Fundamental subproblem: reduce #rows • n-by-d matrix A, nnz non-zeros • nnz > n >> d A A’ • Goal: approximate with A’ with fewer rows • Common requirement: ║Ax║2≈║A’x║2 ∀ x Denote as A ≈ A’ EXISTENCE OF APPROXIMATIONS A A’=SA • • • • ║Ax║22 = xTATAx ATA: d-by-d matrix A’ QR factorization of ATA A’ = SA, S: d-by-n Runtime cost: • Computing ATA: O(nnz × d) • Finding A’ from ATA: O(dω) When n >> d, nnz > poly(d), cost dominated by computing ATA INPUT SPARSITY TIME Find A’ ≈ A in O(nnz) time A A’=SA Oblivious subspace embeddings: • Drineas, Magdon-Ismail, Mahoney, Woodruff `12: S = fast JL matrix • Clarkson, Woodruff `12: S with one non-zero per column • Nelson-Nguyen: near-optimal sparsity OUR RESULT rowSample(A) A’ random half of rows of A A” rowSample(A’) τ approxProbability(A, A”) return sample(A, τ) A’ ≈ A consisting of O(dlogd) rescaled rows of A in O(nnz + dω+θ) time Θ: any constant > 0 • Based on uniform + importance sampling • [Cohen-P `14] extends to p-norms • [Lee-P-Spielman `14] ‘self constructing’ solvers for linear systems in graph Laplacians SUMMARY • Goal: A’ with fewer rows s.t. A ≈ A’ • Existence of A’ evident via QR factorization OUTLINE • Reducing Row Count • Row Sampling and Leverage Scores • Adaptive Uniform Sampling • Proof of Structural Theorem L2 ROW SAMPLING A’ is a subset of rescaled rows of A A A’=SA Matrix S: • One non-zero per row • Generated adaptively based on A Works for (w.h.p.) Oblivious Any A Adaptive Input A Minimum nnz(S) n O(d) ([BSS `09]) Rows of A’ Sums of rows of A Rows of A CONNECTIONS • • • Approximate algebraic algorithms Preconditioner in scientific computing Graph sparsification, expanders A: edge-vertex incidence matrix x: labels on vertices • Row for edge uv: |aix|2 = (xu – xv)2 • x being 0/1 indicator vector: size of cut IMPORTANCE SAMPLING • Keep a row, ai, with probability pi • If picked, rescale (by pi-1/2) to keep expectation Go from n to n’ rows: Uniform sampling: pi = n’/n Issue: only one non-zero row norm sampling: pi =n’ ║ai║22 / ║A║F2 Issue: column with one entry MATRIX-CHERNOFF BOUNDS τ: L2 statistical leverage scores τi = aiT(ATA)+ai ai row i of A M+: pseudo-inverse of M Rudelson, Vershynin `07, Tropp `12: Sampling with pi ≥ τiO( logd) gives A ≈ A’ w.h.p. [Foster `49] Σi τi = rank ≤ d O(dlogd) rows HOW TO COMPUTE LEVERAGE SCORES? τi = aiT(ATA)+ai Given A’ ≈ A with O(dlogd) rows , can estimate leverage scores of A in O(nnz(A) + dω+θ) time • Finding A’ ≈ A need leverage scores • Efficient leverage scoring computation needs A’ ≈ A • Chicken-and-egg problem One solution: use projections, then refine into row sample SUMMARY • Goal: A’ with fewer rows s.t. A ≈ A’ • Existence of good A’ evident via QR factorization • Leverage scores (w.r.t. a subset) good row samples • Total leverage score ≤ d • Chicken and egg problem: need A’ ≈ A to find A’ ≈ A OUTLINE • Reducing Row Count • Row Sampling and Leverage Scores • Adaptive Uniform Sampling • Proof of Structural Theorem ADAPTIVE ROW SAMPLING [Avron `11]: boost effectiveness of randomized algorithms via preconditioning • Find an approximation A’ • Use A’ to compute upper bounds of statistical leverage scores in A • Sample A using these bounds to obtain better approximation A’’ WIDELY USED IN PRACTICE Nystrom method on matrices: • Pick random subset of rows/columns • Compute on subset • Extend result onto the full matrix Uniform sampling does not give spectral approximations! Why is this effective? Post-processing: • Theoretical: copy x over • Practical: projection, least-squares fitting QUESTIONS THAT WE ADDRESS Uniform sample post-process • How well can uniform sampling perform? • What do we gain from post-processing • Are there weaker notions than ≈ that can be used in algorithmic pipelines? ASIDE: WHAT IS LEVERAGE SCORE? How easily a row can be constructed from other rows: τi = min ║x║2 s.t. xA = ai x ai A This view implies: • τi ≤ 1 • τi’ from a sample are good upper bounds UNDEFINED LEVERAGE SCORES A’ from random sample may have smaller rank sampl e post-process τi = aiT(ATA)+ai Fix: add ai to A’ when computing leverage score of ai • A’ ∪ ai subset of rows of A good upper bounds • Same cost: O(nnz + dω+θ + time to approximate A’TA’) Need: bound sum of these bounds WHAT WE SHOW Structural theorem: if we pick half the rows as A’, the expected total estimate is ≤ 2d Algorithmic consequence: recurse on A’ to approximate (A’)T(A’) rowSample(A) A’ random half of rows of A A” rowSample(A’) τ approxProbability(A, A”) return sample(A, τ) Runtime: T(nnz) = T(nnz/2) + O(nnz + dω+θ) SUMMARY • Goal: A’ with fewer rows s.t. A ≈ A’ • Existence of A’ evident via QR factorization • Leverage scores (w.r.t. a subset) good row samples • Total leverage score ≤ d • Chicken and egg problem: need A’ ≈ A to find A’ ≈ A • Uniform sampling + correction widely used • Strong guarantees for adaptive uniform sampling OUTLINE • Reducing Row Count • Row Sampling and Leverage Scores • Adaptive Uniform Sampling • Proof of Structural Theorem RANDOM PROCESS Structural theorem: if we pick half the rows as A’, the expected total estimate is ≤ 2d Formally: • S: random subset of n/2 rows • τ i’ = leverage score of ai in a[S • Claim: ES Σi [τ i’] ≤ 2d ∪ {i}] • Foster’s theorem: sum over rows in A’ ≤ d • Need to bound the rest, ES Σi ∉ S [τ i’] = n/2 × ES, i ∉ S [τ i’] SET WITH ONE EXTRA ELEMENT ES, i ∉ S [τ i’]: over random subset S and random i ∉ S Equivalent to: • Picking S’ = S ∪ {i} at random • with i picked at random from S’ E|S|=n/2, i ∉ S [τ i’] = E|S’|=n/2+1, i ∈ S’ [τ i’] E|S’|=n/2+1, ’] [τ i ∈ S’ i Foster’s theorem: total leverage score in S’ ≤ d Average leverage score of a row of S’ ≤ d / (n/2 + 1) ES, i ∉ S [τ i’] ≤ d / (n/2 + 1) • Total: n/2 × d / (n/2 + 1) ≤ d • Overall bound follows by including rows from S Akin to: • Backwards analysis (e.g. [Siedel `92]) • Randomized O(m) time MST [Klein-Karger-Tarjan `93] WHAT WE ALSO SHOW: Coherence reducing reweighting: For any α < 1, can reweigh d/α rows so all leverage scores ≤α This implies the sampling result • Leverage score ≤ 1/2logd: uniform 1/2 A’ gives A’ ≈ A • Sample only deviated on the rows that we changed • Recomputing leverage scores w.r.t. A’ finds these rows SUMMARY • Goal: A’ with fewer rows s.t. A ≈ A’ • Existence of A’ evident via QR factorization • Leverage scores (w.r.t. a subset) good row samples • Total leverage score ≤ d • Chicken and egg problem: need A’ ≈ A to find A’ ≈ A • Uniform sampling + correction widely used • Strong guarantees for adaptive uniform sampling • Backward analysis of adaptive uniform sampling • Coherence reducing reweighting FUTURE WORK • • • • • • Application in streaming sparsifiers? Algorithms for low rank approximations? Does norm sampling offer more? Backward analysis of other adaptive routines? How close to Nystrom methods can we get? More limited randomness? Reference: http://arxiv.org/abs/1408.5099