### Slide 1

```Uniform Sampling for
Matrix Approximation
Michael Cohen, Yin Tat Lee, Cameron Musco,
Christopher Musco, Richard Peng, Aaron Sidford
M.I.T.
OUTLINE
• Reducing Row Count
• Row Sampling and Leverage Scores
• Proof of Structural Theorem
LARGE MATRICES
Key routine in applications: size reductions via sampling
Fundamental subproblem: reduce #rows
• n-by-d matrix A, nnz non-zeros
• nnz > n >> d
A
A’
• Goal: approximate with
A’ with fewer rows
• Common requirement:
║Ax║2≈║A’x║2 ∀ x
Denote as A ≈ A’
EXISTENCE OF APPROXIMATIONS
A
A’=SA
•
•
•
•
║Ax║22 = xTATAx
ATA: d-by-d matrix
A’  QR factorization of ATA
A’ = SA, S: d-by-n
Runtime cost:
• Computing ATA: O(nnz × d)
• Finding A’ from ATA: O(dω)
When n >> d, nnz > poly(d), cost
dominated by computing ATA
INPUT SPARSITY TIME
Find A’ ≈ A in O(nnz) time
A
A’=SA
Oblivious subspace embeddings:
• Drineas, Magdon-Ismail, Mahoney,
Woodruff `12: S = fast JL matrix
• Clarkson, Woodruff `12: S with one
non-zero per column
• Nelson-Nguyen: near-optimal sparsity
OUR RESULT
rowSample(A)
A’  random half of rows of A
A”  rowSample(A’)
τ  approxProbability(A, A”)
return sample(A, τ)
A’ ≈ A consisting of O(dlogd) rescaled
rows of A in O(nnz + dω+θ) time
Θ: any constant > 0
• Based on uniform + importance sampling
• [Cohen-P `14] extends to p-norms
• [Lee-P-Spielman `14] ‘self constructing’ solvers
for linear systems in graph Laplacians
SUMMARY
• Goal: A’ with fewer rows s.t. A ≈ A’
• Existence of A’ evident via QR factorization
OUTLINE
• Reducing Row Count
• Row Sampling and Leverage Scores
• Proof of Structural Theorem
L2 ROW SAMPLING
A’ is a subset of rescaled rows of A
A
A’=SA
Matrix S:
• One non-zero per row
based on A
Works for (w.h.p.)
Oblivious
Any A
Input A
Minimum nnz(S)
n
O(d) ([BSS `09])
Rows of A’
Sums of rows of A
Rows of A
CONNECTIONS
•
•
•
Approximate algebraic algorithms
Preconditioner in scientific computing
Graph sparsification, expanders
A: edge-vertex
incidence matrix
x: labels on vertices
• Row for edge uv: |aix|2 = (xu – xv)2
• x being 0/1 indicator vector: size of cut
IMPORTANCE SAMPLING
• Keep a row, ai, with probability pi
• If picked, rescale (by pi-1/2) to keep expectation
Go from n to n’ rows:
Uniform sampling: pi = n’/n
Issue: only one non-zero row
norm sampling:
pi =n’ ║ai║22 / ║A║F2
Issue: column with one entry
MATRIX-CHERNOFF BOUNDS
τ: L2 statistical leverage scores
τi = aiT(ATA)+ai
ai row i of A
M+: pseudo-inverse of M
Rudelson, Vershynin `07, Tropp `12:
Sampling with pi ≥ τiO( logd) gives A ≈ A’ w.h.p.
[Foster `49] Σi τi = rank ≤ d
 O(dlogd) rows
HOW TO COMPUTE LEVERAGE SCORES?
τi = aiT(ATA)+ai
Given A’ ≈ A with O(dlogd) rows ,
can estimate leverage scores of
A in O(nnz(A) + dω+θ) time
• Finding A’ ≈ A need leverage scores
• Efficient leverage scoring
computation needs A’ ≈ A
• Chicken-and-egg problem
One solution: use projections,
then refine into row sample
SUMMARY
• Goal: A’ with fewer rows s.t. A ≈ A’
• Existence of good A’ evident via QR factorization
• Leverage scores (w.r.t. a subset)  good row
samples
• Total leverage score ≤ d
• Chicken and egg problem: need A’ ≈ A to find A’ ≈ A
OUTLINE
• Reducing Row Count
• Row Sampling and Leverage Scores
• Proof of Structural Theorem
[Avron `11]: boost effectiveness of randomized
algorithms via preconditioning
• Find an approximation A’
• Use A’ to compute upper bounds
of statistical leverage scores in A
• Sample A using these bounds to
obtain better approximation A’’
WIDELY USED IN PRACTICE
Nystrom method on matrices:
• Pick random subset of rows/columns
• Compute on subset
• Extend result onto the full matrix
Uniform sampling does not give spectral approximations!
Why is this effective?
Post-processing:
• Theoretical: copy x over
• Practical: projection,
least-squares fitting
Uniform
sample
post-process
• How well can uniform sampling perform?
• What do we gain from post-processing
• Are there weaker notions than ≈ that can
be used in algorithmic pipelines?
ASIDE: WHAT IS LEVERAGE
SCORE?
How easily a row can be
constructed from other rows:
τi = min ║x║2 s.t. xA = ai
x
ai
A
This view implies:
• τi ≤ 1
• τi’ from a sample are
good upper bounds
UNDEFINED LEVERAGE SCORES
A’ from random sample may have smaller rank
sampl
e
post-process
τi = aiT(ATA)+ai
Fix: add ai to A’ when computing leverage score of ai
• A’ ∪ ai subset of rows of A  good upper bounds
• Same cost: O(nnz + dω+θ + time to approximate A’TA’)
Need: bound sum of these bounds
WHAT WE SHOW
Structural theorem: if we pick half the rows as A’,
the expected total estimate is ≤ 2d
Algorithmic consequence: recurse on
A’ to approximate (A’)T(A’)
rowSample(A)
A’  random half of rows of A
A”  rowSample(A’)
τ  approxProbability(A, A”)
return sample(A, τ)
Runtime: T(nnz) = T(nnz/2) + O(nnz + dω+θ)
SUMMARY
• Goal: A’ with fewer rows s.t. A ≈ A’
• Existence of A’ evident via QR factorization
• Leverage scores (w.r.t. a subset)  good row
samples
• Total leverage score ≤ d
• Chicken and egg problem: need A’ ≈ A to find A’ ≈ A
• Uniform sampling + correction widely used
• Strong guarantees for adaptive uniform sampling
OUTLINE
• Reducing Row Count
• Row Sampling and Leverage Scores
• Proof of Structural Theorem
RANDOM PROCESS
Structural theorem: if we pick half the rows as A’,
the expected total estimate is ≤ 2d
Formally:
• S: random subset of n/2 rows
• τ i’ = leverage score of ai in a[S
• Claim: ES Σi [τ i’] ≤ 2d
∪ {i}]
• Foster’s theorem: sum over rows in A’ ≤ d
• Need to bound the rest, ES Σi ∉ S [τ i’]
= n/2 × ES, i ∉ S [τ i’]
SET WITH ONE EXTRA ELEMENT
ES, i ∉ S [τ i’]: over random subset S and random i ∉ S
Equivalent to:
•
Picking S’ = S ∪ {i} at random
•
with i picked at random from S’
E|S|=n/2, i ∉ S [τ i’] = E|S’|=n/2+1,
i ∈ S’ [τ i’]
E|S’|=n/2+1,
’]
[τ
i ∈ S’ i
Foster’s theorem: total leverage score in S’ ≤ d
Average leverage score of a row of S’ ≤ d / (n/2 + 1)
ES, i ∉ S [τ i’] ≤ d / (n/2 + 1)
• Total: n/2 × d / (n/2 + 1) ≤ d
• Overall bound follows by
including rows from S
Akin to:
• Backwards analysis (e.g. [Siedel `92])
• Randomized O(m) time MST [Klein-Karger-Tarjan `93]
WHAT WE ALSO SHOW:
Coherence reducing reweighting:
For any α < 1, can reweigh d/α
rows so all leverage scores ≤α
This implies the sampling result
• Leverage score ≤ 1/2logd:
uniform 1/2 A’ gives A’ ≈ A
• Sample only deviated on the
rows that we changed
• Recomputing leverage scores
w.r.t. A’ finds these rows
SUMMARY
• Goal: A’ with fewer rows s.t. A ≈ A’
• Existence of A’ evident via QR factorization
• Leverage scores (w.r.t. a subset)  good row
samples
• Total leverage score ≤ d
• Chicken and egg problem: need A’ ≈ A to find A’ ≈ A
• Uniform sampling + correction widely used
• Strong guarantees for adaptive uniform sampling
• Backward analysis of adaptive uniform sampling
• Coherence reducing reweighting
FUTURE WORK
•
•
•
•
•
•
Application in streaming sparsifiers?
Algorithms for low rank approximations?
Does norm sampling offer more?
Backward analysis of other adaptive routines?
How close to Nystrom methods can we get?
More limited randomness?
Reference: http://arxiv.org/abs/1408.5099
```