### Presentation

```From Differential Privacy to
Machine Learning, and Back
Yahoo Labs, Sunnyvale
Thesis: Differential privacy ⇒ generalizability
Stable learning ⇒ differential privacy
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Recap: Differential privacy
Differential Privacy [DMNS06, DKMMN06]
• Adversary learns essentially the same thing irrespective of your
presence or absence in the data set
1
Data set:
A
Random coins
A()
1
Data set: ’
A
A(’)
Random coins
•  and ′ are called neighboring data sets
• Require: Neighboring data sets induce close distribution on outputs
Differential Privacy [DMNS06, DKMMN06]
Definition:
A randomized algorithm A is (, )-differentially private if
• for all data sets  and  ′ that differ in one element
• for all sets of answers

Pr A  ∈  ≤  Pr[A(′) ∈ ] +
Semantics of Differential Privacy
• Differential privacy is a condition on the algorithm
• Guarantee is meaningful in the presence of any auxiliary information
• Typically, think of privacy parameters:  ≈ 0.1 and  = 1/log  ,
where  = # of data samples
• Composition:  ’s and  ‘s add up over multiple executions
Few tools to design differentially private
algorithms
Laplace Mechanism [DMNS06]
Data set  = {1 , ⋯ ,  } and :  ∗ → ℝ be a function on
Global sensitivity: GS(, 1)=
max
,′ , ,′ =1
|  −   ′ | 1
1. : Random variable sampled from Lap(GS(, 1)/)
2. Output   +
Theorem (Privacy): Algorithm is -differentially private
Gaussian Mechanism [DKMMN06]
Data set  = {1 , ⋯ ,  } and :  ∗ → ℝ be a function on
Global sensitivity: GS(, 2)=
max
,′ , ,′ =1
|  −   ′ |
1. : Random variable sampled from  0,  ( , 2
1//
2. Output   +
Theorem (Privacy): Algorithm is (, )-differentially private
2
Report-Noisy-Max (a.k.a. Exponential Mechanism)
[MT07,BLST10]
Set of candidate outputs  =
Score function :  ×   → ℝ
Domain of data sets
Objective: Output  ∈  , maximizing (, )
Global sensitivity GS()= max | ,  − (, ′)| for all neighbors
∈,,′
and ′
Report-Noisy-Max (a.k.a. Exponential Mechanism)
[MT07,BLST10]
Objective: Output  ∈  , maximizing (, )
Global sensitivity GS()= max | ,  − (, ′)| for all neighbors
∈,,′
and ′
1.  ←  ,  + Lap(GS()/)
2. Output  with highest value of
Theorem (Privacy): Algorithm is 2-differentially private
Composition Theorems [DMNS06,DL09,DRV10]
(, )-diff. private algorithm 1
Data set
(, )-diff. private algorithm
Weak composition:
1 ∘ ⋯ ∘  is (, )-diff. private
Strong composition ≈
1 ∘ ⋯ ∘  is ( , )-diff. private
Recap: (Private) convex risk minimization
Convex Empirical Risk Minimization (ERM): An Example
Linear classifiers in ℝ

Domain:
Feature
vector

∈
ℝ
Data set  = 1 , 1 , ⋯ ,  ,
Label  ∈{yellow, red}
∼i.i.d.
+11 -1
ERM: Use ℒ ;  =
=1  〈 , 〉 to approximate
Distribution  over (, )
Find  ∈  that classifies ,  ∼
Minimize risk:  = arg min
∈
, ∼
〈, 〉
Convex set  of
constant diameter
Empirical Risk Minimization (ERM) Setup
Convex loss function: ℓ:  ×  → ℝ
Data set  = {1 , ⋯ ,  }
Regularized ERM:  =
Loss function: ℒ(; )
1
arg min
∈

=1 ℓ(;  )
Objective: Minimize excess risk
∼ ℓ(; ) − min ∼ ℓ(; )
∈
regularizer
+ ()
Used to stop overfitting
Empirical Risk Minimization (ERM) Setup
Convex loss function: ℓ:  ×  → ℝ
Loss function: ℒ(; )
Data set  = {1 , ⋯ ,  }
Recall: Differential privacy+ low excess empirical risk
1
⇒ Risk
LowMinimizer:
excess truerisk
Empirical
= arg min
=1 ℓ(;  )
∈
Today: Differentially private algorithm to output priv that minimizes
ℒ(priv ; ) − min ℒ(; )
∈
Empirical Risk Minimization (ERM) Setup
Today: Differentially private algorithm to output priv that minimizes
ℒ(priv ; ) − min ℒ(; )
∈
• 2 /2 -setting: ||ℓ ;  ||2 ≤ 1 and ||||2 ≤ 1
• 1 /∞ -setting: ||ℓ ;  ||∞ ≤ 1 and ||||1 ≤ 1
• Algorithms: Frank-Wolfe and LASSO
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Private ERM and noisy gradient descent
(for 2/2-setting) [Bassily, Smith, T.’14]
Convex empirical risk minimization: 2 /2 -setting
Δ-strong convexity
Lipschitz continuity
≤ ||1 − 2 ||2
ℓ(; )
Δ
≥ (1 − )||1 − 2 ||2
2
(for all  ∈ [0,1])
1
2
∀1 , 2 ∈ , ℓ 1 ;  − ℓ 2 ;
≤ ||1 − 2 ||2
t
1
ℓ(; )
: (1-t)
2
Bounded set :
• Assume ||||2 is bounded by const.
Why privacy is a concern in ERM?
Computing the median
Data set  = 1, ⋯ ,  ,where each  ∈ ℝ
Median: ∗ = arg min
∈ℝ

=1 |
−  |
Median is a data point in
1
2
∗

Why privacy is a concern in ERM?
Support vector machine: Dual-formulation
Support vectors
Separating hyperplane
Support vectors are essentially data
points in
Δ − strongly convex
Lipschitz
Our results (data set size = , dimension of set  is  < )
Privacy
-DP
(, )-DP
-DP
(, )-DP
Excess empirical risk
Technique
(/)
Exponential sampling
(based on [MT07])
(  log 2 (/) /)
(2 /(2 ))
log 2.5 (/)

2 Δ
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Δ − strongly convex
Lipschitz
Our results (data set size = , dimension of set  is  < )
Privacy
-DP
(, )-DP
-DP
(, )-DP
Excess empirical risk
Technique
(/)
Exponential sampling
(based on [MT07])
(  log 2 (/) /)
(2 /(2 ))
log 2.5 (/)

2 Δ
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Δ − strongly convex
Lipschitz
Our results (data set size = , dimension of set  is  < )
Privacy
-DP
(, )-DP
-DP
(, )-DP
Excess empirical risk
Technique
(/)
Exponential sampling
(based on [MT07])
(  log 2 (/) /)
(2 /(2 ))
log 2.5 (/)

2 Δ
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Δ − strongly convex
Lipschitz
Our results (data set size = , dimension of set  is  < )
Privacy
-DP
(, )-DP
-DP
(, )-DP
Excess empirical risk
Technique
(/)
Exponential sampling
(based on [MT07])
(  log 2 (/) /)
(2 /(2 ))
log 2.5 (/)

2 Δ
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Our results (data set size = , dimension of set  is  < )
Lipschitz
Privacy
(, )-DP
Excess empirical risk
(  log 2 (/) /)
Technique
(formal analysis and improvements
over [WM10] and [CSS13])
Loss functions for
ℓ(; 1 )
ℓ(; 2 )
ℓ(;  )
1. Choose arbitrary 1 ∈
1
Convex set:
Loss functions for
ℓ(; 1 )
Learning rate=(1/
ℓ(; 2 ) )
ℓ(;  )
2. For each time step  ∈ []
a. Sample  ∼ {1, ⋯ , }
b. +1 ←  −   ⋅ ℓ  ;  +
where  ∼ (0, (Δ, )

ℓ(; )
Loss functions for
ℓ(; 1 )
ℓ(; 2 )
ℓ(;  )
3. Project  onto , i.e.,
+1 ← Π ()

+1
Convex set:
1. Choose arbitrary 1 ∈
2. For each time step  ∈ []
1
a. Sample  ∼ {1, ⋯ , }

Finally,
output

b.  ←  −   ⋅ ℓ  ;  +
where  ∼ (0, (Δ, )
3. Project  onto , i.e.,
+1 ← Π ()

Privacy guarantee: For  = 2 , Priv-SGD is ,  −differentially private
Key insights [denote # of iterations by  = 2 , and ignore ]:
1. Providing  ≈
1

diff. privacy per iteration is sufficient
2. [KRSU10] Sampling ensures  ≈
1

diff. privacy to
←  −   ⋅ ℓ  ;  +
Utility guarantee (for  = 2 ):
ℒ  ;  − min ℒ ;  =
∈
log 2 (/)

Key insight:
∼
ℓ(; 1 )
ℓ(;  )
ℓ(; )
Utility guarantee (for  = 2 ):
ℒ  ;  − min ℒ ;  =
∈
log 2 (/)

Key insight:

ℓ(; )
,  ⋅ ℓ  ;  +  =  ⋅ ℒ( ; )
1. Unbiased estimator of the gradient:
,  ⋅ ℓ  ;  +  =  ⋅ ℒ( ; )
2. Bounded variance:
, || ⋅ ℓ  ;  +  ||22 = (2 log(/) / 2 )
Utility guarantee now follows from [SZ’13]
Running time: Assuming computing gradient takes time ()
Running time of Priv-SGD is (2 )
Note: If we computed the true gradient of ℒ, instead of sampling,
running time would have been (3 )
Utility guarantee of Priv-SGD is optimal.
• Proof via Fingerprinting codes [BS96,Tardos03]
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Private ERM and the Frank-Wolfe
(for 1/∞-setting) [Talwar, T., Zhang]
Empirical Risk Minimization (ERM) Setup
Today: Differentially private algorithm to output priv that minimizes
ℒ(priv ; ) − min ℒ(; )
∈
Commonly referred to as
high-dimensional setting
• 2 /2 -setting: ||ℓ ;  ||2 ≤ 1 and ||||2 ≤ 1
• 1 /∞ -setting: ||ℓ ;  ||∞ ≤ 1 and ||||1 ≤ 1
• Algorithms: Frank-Wolfe and LASSO
Frank-Wolfe on the 1 -ball (a stylized exposition)
1. Pick a corner of the 1 −ball
(1 ∈ ℝ )
Linearized loss
2.  ← arg min〈ℒ( ; ), 〉
∈1
3. +1 ←  + 1 −
Typically,  ≈ 1/

1
Frank-Wolfe on the 1 -ball (a stylized exposition)
•  is always a corner
Linearized loss
• Final output  is always a convex
combination of the corners
• For smooth losses, the
convergence≈ (1/)
[FW56, Jaggi’13]

1
Differentially private Frank-Wolfe
Recap: Report-Noisy-Max (a.k.a. Exponential Mechanism)
Objective: Output  ∈  , maximizing (, )
Global sensitivity GS()= max | ,  − (, ′)| for all neighbors
∈,,′
and ′
1.  ←  ,  + Lap(GS()/)
2. Output  with highest value of
Theorem (Privacy): Algorithm is 2-differentially private
Private Frank-Wolfe (hiding terms in )
1. Pick a corner of the 1 −ball
(1 ∈ ℝ )
2.  ← arg min〈ℒ  ;  +  , 〉,  ∼ Lap( /)
∈1
3. +1 ←  / + 1 − 1/
Report-noisy-max +
strong composition
Theorem (privacy): Algorithm is (, ) −differentially private
Private Frank-Wolfe (hiding terms in )
Theorem (utility): For  ≈ 2/3 ,
ℒ T ;
− min ℒ ;  =
∈1
log
2/3
The guarantee is tight. Optimality via fingerprinting codes
[BS96,Tardos03]
Guarantee is meaningful even when  ≫
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Feature selection in high-dimensional regression
Sparse Linear Regression in High-dimensions ( ≫ )
• Data set:  = { 1 , 1 , ⋯ ,  ,  } where  ∈ ℝ and  ∈ ℝ
• Assumption: Data generated by noisy linear system
=
Feature vector

Parameter vector

∗
×1
+

Field noise
Data normalization:
• ∀ ∈  , || ||∞ ≤ 1
• ∀,  is sub-Gaussian
Sparse Linear Regression in High-dimensions ( ≫ )
• Data set:  = { 1 , 1 , ⋯ ,  ,  } where  ∈ ℝ and  ∈ ℝ
×1
×
∗
×1
+
Field noise
=
Design matrix
Parameter vector
Response vector
• Assumption: Data generated by noisy linear system
×1
×1
=
Design matrix
+
×1
×
∗
×1
• Sparsity:  ∗ has  <  non-zero entries
• Bounded norm: ∀
Field noise
Response vector
Sparse Linear Regression in High-dimensions ( ≫ )
∗
| |
This talk: With differential
privacy
∈ (1 − Φ, Φ) for arbitrary small const. Φ
Model selection problem: Find the non-zero coordinates of  ∗
×1
=
Design matrix
+
Field noise
Response vector
Sparse Linear Regression in High-dimensions ( ≫ )
×1
×
∗
×1
Model selection: Non-zero coordinates (or the support) of  ∗
Solution: LASSO estimator [Tibshirani94,EFJT03,Wainwright06,CT07,ZY07,…]
1
∈ arg min
|| − ||22 + Λ||||1
∈ℝ 2
Consistency of the LASSO Estimator
=
+
Consistency conditions* [Wainwright06,ZY07]:
• Γ: Support of the underlying parameter vector  ∗
Γ
Incoherence
||| Γ Γ
Γ
Restricted Strong Convexity
1
−1

Γ Γ |||∞ <
4
Γ Γ = Ω()
Consistency of the LASSO Estimator
=
+
Consistency conditions* [Wainwright06,ZY07]:
• Γ: Support of the underlying parameter vector  ∗
Theorem*: Under proper choice of Λ and  = Ω(log ), support
of the LASSO estimator  equals support of  ∗
Incoherence
||| Γ Γ
Restricted Strong Convexity
1
−1

Γ Γ |||∞ <
4
Γ Γ = Ω()
Stochastic Consistency of the LASSO
=
+
Consistency conditions* [Wainwright06,ZY07]:
• Γ: Support of the underlying parameter vector  ∗
Incoherence
||| Γ Γ Γ Γ
Restricted Strong Convexity
−1
1
|||∞ <
4
Γ Γ = Ω()
Theorem [Wainwright06,ZY07]: If each data entry in  ∼ (0,1
/4), then the assumptions above are satisfied w.h.p.
Notion of Neighboring Data sets
Design matrix
Response vector

Data set  =

Notion of Neighboring Data sets
Design matrix
′
Data set ′ =
Response vector
′

and ′ are neighboring data sets
Perturbation stability
(a.k.a. zero local sensitivity)
Aside: Local Sensitivity [NRS07]
Data set  = {1 , ⋯ ,  } and :  ∗ → ℝ be a function on
Local sensitivity: LS(, , 1)=
max ′
′ , , =1
|  −   ′ | 1
1. : Random variable sampled from Lap(LS(, , 1)/)
2. Output   +
Not differentially private
Part II : We show that local sensitivity is an useful tool
Perturbation Stability
Data set
1
2
Function
⋮

Output
Perturbation Stability
Data set ′
1
2 ′
Function
⋮

Output
Stability of  at : The output does not change on changing any one entry
Equivalently, local sensitivity of  at  is zero
Distance to Instability Property
• Definition: A function :  ∗ → ℜ is  −stable at a data set  if
• For any data set ′ ∈  ∗ , with Δ′ ≤ ,
= ( ′ )
• Distance to instability:
max(    − )

• Objective: Output () while preserving
differential privacy
All data sets
Unstable data sets
Distance >

Stable data sets
Propose-Test-Release (PTR) framework
[DL09, KRSY11, Smith T.’13]
A Meta-algorithm: Propose-Test-Release (PTR)
Basic tool: Laplace
mechanism
1.  ← max(    − )

2.  ←  +
3. If  >
(1/)
,

1

then return (), else return ⊥
Theorem: The algorithm is ,  −differentially private
Theorem: If  is
2 1/

outputs ()
-stable at , then w.p. ≥ 1 −  the algorithm
Recap: Propose-Test-Release
Framework
(PTR)
TBD: Some global sensitivity one query
1.  ← max(    − )

2.  ←  +
3. If  >
(1/)
,

1

then return (), else return ⊥
Theorem: The algorithm is ,  −differentially private
Theorem: If  is
2 1/

outputs ()
-stable at , then w.p. ≥ 1 −  the algorithm
Instantiation of PTR for the LASSO
=
LASSO:  ∈
1
arg min ||
∈ℝ 2
− ||22 + Λ||||1
• Set function  =support of
• Issue: For , distance to instability might not be efficiently
computable
+
From [Smith,T.’13]
Consistency conditions
Perturbation stability
Proxy conditions
This talk
Consistency conditions
Perturbation stability
Proxy conditions
(Efficiently testable with privacy)
Perturbation Stability of the LASSO
=
LASSO:  ∈
1
arg min ||
∈ℝ 2
+
− ||22 + Λ||||1
()
Theorem: Consistency conditions on LASSO are sufficient for
perturbation stability
Proof Sketch: 1. Analyze Karush-Kuhn-Tucker (KKT) optimality
conditions at
2. Show that support() is stable via using
‘’dual certificate’’ on stable instances
Perturbation Stability of the LASSO
=
Proof Sketch:
1
Gradient of LASSO  ()= −   −  + Λ||||1

Lasso objective on

0 ∈  ()
Lasso objective on ′
′
0 ∈ ′ (′)
+
Perturbation Stability of the LASSO
=
Proof Sketch:
1
Gradient of LASSO  ()= −   −  + Λ||||1

Argue using the optimality conditions of  () and ′ (′)
1. No zero coordinates of  become non-zero in ′
(use mutual incoherence condition)
2. No non-zero coordinates of  become zero in ′
(use restricted strong convexity condition)
+
Perturbation Stability Test for the LASSO
=
+
0 0
Γ: Support of
Γ c : Complement of the support of
Test for the following (real test is more complex):
• Restricted Strong Convexity (RSC): Minimum eigenvalue of ΓT XΓ is Ω()
• Strong stability: Negative of the (absolute) coordinates of the gradient
of the least-squared loss in Γ  are ≪ Λ
Geometry of the Stability of LASSO
=
+
Intuition: Strong convexity ensures supp()⊆ supp(′)
1. Strong convexity ensures
||Γ − ′Γ ||∞ is small
Lasso objective along Γ
2. If ∀, |Γ ()| is large, then
∀, ′Γ  > 0
3. Consistency conditions imply
∀, |Γ ()| is large
Dimension 2 in Γ
Dimension 1 in Γ

Geometry of the Stability of LASSO
=
+
Intuition: Strong stability ensures no zero coordinate in  becomes
non-zero in ′
Lasso objective along Γ c
Slope: Λ
Slope: -Λ
Dimension 1 in Γ
Dimension 2 in Γ

• For the minimizer  to move along Γ  , the perturbation to the
gradient of least-squared loss has to be large
Geometry of the Stability of LASSO
=
+
Γ
−   −  =
Lasso objective along Γ c

Γc
Slope: Λ
Slope: -Λ

Dimension 1 in Γ
Dimension 2 in Γ

• Strong stability: | | ≪ Λ for all  ∈ Γ  ⇒
∈ Γ  has a sub-gradient of zero for LASSO(′)
Making the Stability Test Private (Simplified)
=
+
Test for Restricted Strong Convexity: 1
2
Test for strong stability: 2
Issue: If 1  > 1 and 2  > 2 , then
sensitivities are Δ1 and Δ2
1
Our solution: Proxy distance
= max
−
min
Δ

•  has global sensitivity of one
1 and 2 are both large
and insensitive
+ 1,0
Private Model Selection with Optimal Sample
Complexity
=
Nearly optimal sample complexity
+
1. Compute = function of 1 () and 2
2.  ←  +
3. If  >
(/)
,

1

then return (), else return ⊥
Theorem: The algorithm is ,  −differentially private
Theorem: Under consistency conditions , log  >  2  3 and  = Ω( log ),
w.h.p. the support of  ∗ is output. Here  = log(1/)/.
Thesis: Differential privacy ⇒ generalizability
Stable learning ⇒ differential privacy
Part I of this Talk
1. Towards a rigorous notion of statistical data privacy
2. Differential privacy: An overview
3. Generalization guarantee via differential privacy
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Concluding Remarks
• Diff. privacy and robust (stable) machine learning are closely related
• Not in this talk:
• Private machine learning via bootstrapping [Smith T.’13]
• Private non-convex learning via bootstrapping [BDMRTW]
• False discovery rate control via differential privacy [DFHPR14]
Open Questions
• Develop the theory behind private non-convex learning
• Analyze algorithms like expectation maximization and alternating
minimization
• Private learning with time-series data (e.g., auto-regressive models)
• Private matrix completion (the Netflix problem) using Frank-Wolfe
```