Report

From Differential Privacy to Machine Learning, and Back Abhradeep Guha Thakurta Yahoo Labs, Sunnyvale Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Recap: Differential privacy Differential Privacy [DMNS06, DKMMN06] • Adversary learns essentially the same thing irrespective of your presence or absence in the data set 1 Data set: A Random coins A() 1 Data set: ’ A A(’) Random coins • and ′ are called neighboring data sets • Require: Neighboring data sets induce close distribution on outputs Differential Privacy [DMNS06, DKMMN06] Definition: A randomized algorithm A is (, )-differentially private if • for all data sets and ′ that differ in one element • for all sets of answers Pr A ∈ ≤ Pr[A(′) ∈ ] + Semantics of Differential Privacy • Differential privacy is a condition on the algorithm • Guarantee is meaningful in the presence of any auxiliary information • Typically, think of privacy parameters: ≈ 0.1 and = 1/log , where = # of data samples • Composition: ’s and ‘s add up over multiple executions Few tools to design differentially private algorithms Laplace Mechanism [DMNS06] Data set = {1 , ⋯ , } and : ∗ → ℝ be a function on Global sensitivity: GS(, 1)= max ,′ , ,′ =1 | − ′ | 1 1. : Random variable sampled from Lap(GS(, 1)/) 2. Output + Theorem (Privacy): Algorithm is -differentially private Gaussian Mechanism [DKMMN06] Data set = {1 , ⋯ , } and : ∗ → ℝ be a function on Global sensitivity: GS(, 2)= max ,′ , ,′ =1 | − ′ | 1. : Random variable sampled from 0, ( , 2 1// 2. Output + Theorem (Privacy): Algorithm is (, )-differentially private 2 Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Set of candidate outputs = Score function : × → ℝ Domain of data sets Objective: Output ∈ , maximizing (, ) Global sensitivity GS()= max | , − (, ′)| for all neighbors ∈,,′ and ′ Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Objective: Output ∈ , maximizing (, ) Global sensitivity GS()= max | , − (, ′)| for all neighbors ∈,,′ and ′ 1. ← , + Lap(GS()/) 2. Output with highest value of Theorem (Privacy): Algorithm is 2-differentially private Composition Theorems [DMNS06,DL09,DRV10] (, )-diff. private algorithm 1 Data set (, )-diff. private algorithm Weak composition: 1 ∘ ⋯ ∘ is (, )-diff. private Strong composition ≈ 1 ∘ ⋯ ∘ is ( , )-diff. private Recap: (Private) convex risk minimization Convex Empirical Risk Minimization (ERM): An Example Linear classifiers in ℝ Domain: Feature vector ∈ ℝ Data set = 1 , 1 , ⋯ , , Label ∈{yellow, red} ∼i.i.d. +11 -1 ERM: Use ℒ ; = =1 〈 , 〉 to approximate Distribution over (, ) Find ∈ that classifies , ∼ Minimize risk: = arg min ∈ , ∼ 〈, 〉 Convex set of constant diameter Empirical Risk Minimization (ERM) Setup Convex loss function: ℓ: × → ℝ Data set = {1 , ⋯ , } Regularized ERM: = Loss function: ℒ(; ) 1 arg min ∈ =1 ℓ(; ) Objective: Minimize excess risk ∼ ℓ(; ) − min ∼ ℓ(; ) ∈ regularizer + () Used to stop overfitting Empirical Risk Minimization (ERM) Setup Convex loss function: ℓ: × → ℝ Loss function: ℒ(; ) Data set = {1 , ⋯ , } Recall: Differential privacy+ low excess empirical risk 1 ⇒ Risk LowMinimizer: excess truerisk Empirical = arg min =1 ℓ(; ) ∈ Today: Differentially private algorithm to output priv that minimizes ℒ(priv ; ) − min ℒ(; ) ∈ Empirical Risk Minimization (ERM) Setup Today: Differentially private algorithm to output priv that minimizes ℒ(priv ; ) − min ℒ(; ) ∈ • 2 /2 -setting: ||ℓ ; ||2 ≤ 1 and ||||2 ≤ 1 • Algorithm: Private gradient descent • 1 /∞ -setting: ||ℓ ; ||∞ ≤ 1 and ||||1 ≤ 1 • Algorithms: Frank-Wolfe and LASSO Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Private ERM and noisy gradient descent (for 2/2-setting) [Bassily, Smith, T.’14] Convex empirical risk minimization: 2 /2 -setting Δ-strong convexity Lipschitz continuity ≤ ||1 − 2 ||2 ℓ(; ) Δ ≥ (1 − )||1 − 2 ||2 2 (for all ∈ [0,1]) 1 2 ∀1 , 2 ∈ , ℓ 1 ; − ℓ 2 ; ≤ ||1 − 2 ||2 t 1 ℓ(; ) : (1-t) 2 Bounded set : • Assume ||||2 is bounded by const. Why privacy is a concern in ERM? Computing the median Data set = 1, ⋯ , ,where each ∈ ℝ Median: ∗ = arg min ∈ℝ =1 | − | Median is a data point in 1 2 ∗ Why privacy is a concern in ERM? Support vector machine: Dual-formulation Support vectors Separating hyperplane Support vectors are essentially data points in Δ − strongly convex Lipschitz Our results (data set size = , dimension of set is < ) Privacy -DP (, )-DP -DP (, )-DP Excess empirical risk Technique (/) Exponential sampling (based on [MT07]) ( log 2 (/) /) (2 /(2 )) log 2.5 (/) 2 Δ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = , dimension of set is < ) Privacy -DP (, )-DP -DP (, )-DP Excess empirical risk Technique (/) Exponential sampling (based on [MT07]) ( log 2 (/) /) (2 /(2 )) log 2.5 (/) 2 Δ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = , dimension of set is < ) Privacy -DP (, )-DP -DP (, )-DP Excess empirical risk Technique (/) Exponential sampling (based on [MT07]) ( log 2 (/) /) (2 /(2 )) log 2.5 (/) 2 Δ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = , dimension of set is < ) Privacy -DP (, )-DP -DP (, )-DP Excess empirical risk Technique (/) Exponential sampling (based on [MT07]) ( log 2 (/) /) (2 /(2 )) log 2.5 (/) 2 Δ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Our results (data set size = , dimension of set is < ) Lipschitz Privacy (, )-DP Excess empirical risk ( log 2 (/) /) Technique Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) Private stochastic gradient descent Private stochastic gradient descent (Priv-SGD) Loss functions for ℓ(; 1 ) ℓ(; 2 ) ℓ(; ) 1. Choose arbitrary 1 ∈ 1 Convex set: Private stochastic gradient descent (Priv-SGD) Loss functions for ℓ(; 1 ) Learning rate=(1/ ℓ(; 2 ) ) ℓ(; ) 2. For each time step ∈ [] a. Sample ∼ {1, ⋯ , } b. +1 ← − ⋅ ℓ ; + where ∼ (0, (Δ, ) ℓ(; ) Private stochastic gradient descent (Priv-SGD) Loss functions for ℓ(; 1 ) ℓ(; 2 ) ℓ(; ) 3. Project onto , i.e., +1 ← Π () +1 Convex set: Private stochastic gradient descent (Priv-SGD) 1. Choose arbitrary 1 ∈ 2. For each time step ∈ [] 1 a. Sample ∼ {1, ⋯ , } Finally, output b. ← − ⋅ ℓ ; + where ∼ (0, (Δ, ) 3. Project onto , i.e., +1 ← Π () Private stochastic gradient descent (Priv-SGD) Privacy guarantee: For = 2 , Priv-SGD is , −differentially private Key insights [denote # of iterations by = 2 , and ignore ]: 1. Providing ≈ 1 diff. privacy per iteration is sufficient 2. [KRSU10] Sampling ensures ≈ 1 diff. privacy to ← − ⋅ ℓ ; + Private stochastic gradient descent (Priv-SGD) Utility guarantee (for = 2 ): ℒ ; − min ℒ ; = ∈ log 2 (/) Key insight: ∼ ℓ(; 1 ) ℓ(; ) ℓ(; ) Private stochastic gradient descent (Priv-SGD) Utility guarantee (for = 2 ): ℒ ; − min ℒ ; = ∈ log 2 (/) Key insight: Unbiased estimator of the gradient: ℓ(; ) , ⋅ ℓ ; + = ⋅ ℒ( ; ) Private stochastic gradient descent (Priv-SGD) 1. Unbiased estimator of the gradient: , ⋅ ℓ ; + = ⋅ ℒ( ; ) 2. Bounded variance: , || ⋅ ℓ ; + ||22 = (2 log(/) / 2 ) Utility guarantee now follows from [SZ’13] Private stochastic gradient descent (Priv-SGD) Running time: Assuming computing gradient takes time () Running time of Priv-SGD is (2 ) Note: If we computed the true gradient of ℒ, instead of sampling, running time would have been (3 ) Utility guarantee of Priv-SGD is optimal. • Proof via Fingerprinting codes [BS96,Tardos03] Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Private ERM and the Frank-Wolfe (for 1/∞-setting) [Talwar, T., Zhang] Empirical Risk Minimization (ERM) Setup Today: Differentially private algorithm to output priv that minimizes ℒ(priv ; ) − min ℒ(; ) ∈ Commonly referred to as high-dimensional setting • 2 /2 -setting: ||ℓ ; ||2 ≤ 1 and ||||2 ≤ 1 • Algorithm: Private gradient descent • 1 /∞ -setting: ||ℓ ; ||∞ ≤ 1 and ||||1 ≤ 1 • Algorithms: Frank-Wolfe and LASSO Frank-Wolfe on the 1 -ball (a stylized exposition) 1. Pick a corner of the 1 −ball (1 ∈ ℝ ) Linearized loss 2. ← arg min〈ℒ( ; ), 〉 ∈1 3. +1 ← + 1 − Typically, ≈ 1/ 1 Frank-Wolfe on the 1 -ball (a stylized exposition) • is always a corner Linearized loss • Final output is always a convex combination of the corners • For smooth losses, the convergence≈ (1/) [FW56, Jaggi’13] 1 Differentially private Frank-Wolfe Recap: Report-Noisy-Max (a.k.a. Exponential Mechanism) Objective: Output ∈ , maximizing (, ) Global sensitivity GS()= max | , − (, ′)| for all neighbors ∈,,′ and ′ 1. ← , + Lap(GS()/) 2. Output with highest value of Theorem (Privacy): Algorithm is 2-differentially private Private Frank-Wolfe (hiding terms in ) 1. Pick a corner of the 1 −ball (1 ∈ ℝ ) 2. ← arg min〈ℒ ; + , 〉, ∼ Lap( /) ∈1 3. +1 ← / + 1 − 1/ Report-noisy-max + strong composition Theorem (privacy): Algorithm is (, ) −differentially private Private Frank-Wolfe (hiding terms in ) Theorem (utility): For ≈ 2/3 , ℒ T ; − min ℒ ; = ∈1 log 2/3 The guarantee is tight. Optimality via fingerprinting codes [BS96,Tardos03] Guarantee is meaningful even when ≫ Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Feature selection in high-dimensional regression Sparse Linear Regression in High-dimensions ( ≫ ) • Data set: = { 1 , 1 , ⋯ , , } where ∈ ℝ and ∈ ℝ • Assumption: Data generated by noisy linear system = Feature vector Parameter vector ∗ ×1 + Field noise Data normalization: • ∀ ∈ , || ||∞ ≤ 1 • ∀, is sub-Gaussian Sparse Linear Regression in High-dimensions ( ≫ ) • Data set: = { 1 , 1 , ⋯ , , } where ∈ ℝ and ∈ ℝ ×1 × ∗ ×1 + Field noise = Design matrix Parameter vector Response vector • Assumption: Data generated by noisy linear system ×1 ×1 = Design matrix + ×1 × ∗ ×1 • Sparsity: ∗ has < non-zero entries • Bounded norm: ∀ Field noise Response vector Sparse Linear Regression in High-dimensions ( ≫ ) ∗ | | This talk: With differential privacy ∈ (1 − Φ, Φ) for arbitrary small const. Φ Model selection problem: Find the non-zero coordinates of ∗ ×1 = Design matrix + Field noise Response vector Sparse Linear Regression in High-dimensions ( ≫ ) ×1 × ∗ ×1 Model selection: Non-zero coordinates (or the support) of ∗ Solution: LASSO estimator [Tibshirani94,EFJT03,Wainwright06,CT07,ZY07,…] 1 ∈ arg min || − ||22 + Λ||||1 ∈ℝ 2 Consistency of the LASSO Estimator = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector ∗ Γ Incoherence ||| Γ Γ Γ Restricted Strong Convexity 1 −1 Γ Γ |||∞ < 4 Γ Γ = Ω() Consistency of the LASSO Estimator = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector ∗ Theorem*: Under proper choice of Λ and = Ω(log ), support of the LASSO estimator equals support of ∗ Incoherence ||| Γ Γ Restricted Strong Convexity 1 −1 Γ Γ |||∞ < 4 Γ Γ = Ω() Stochastic Consistency of the LASSO = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector ∗ Incoherence ||| Γ Γ Γ Γ Restricted Strong Convexity −1 1 |||∞ < 4 Γ Γ = Ω() Theorem [Wainwright06,ZY07]: If each data entry in ∼ (0,1 /4), then the assumptions above are satisfied w.h.p. Notion of Neighboring Data sets Design matrix Response vector Data set = Notion of Neighboring Data sets Design matrix ′ Data set ′ = Response vector ′ and ′ are neighboring data sets Perturbation stability (a.k.a. zero local sensitivity) Aside: Local Sensitivity [NRS07] Data set = {1 , ⋯ , } and : ∗ → ℝ be a function on Local sensitivity: LS(, , 1)= max ′ ′ , , =1 | − ′ | 1 1. : Random variable sampled from Lap(LS(, , 1)/) 2. Output + Not differentially private Part II : We show that local sensitivity is an useful tool Perturbation Stability Data set 1 2 Function ⋮ Output Perturbation Stability Data set ′ 1 2 ′ Function ⋮ Output Stability of at : The output does not change on changing any one entry Equivalently, local sensitivity of at is zero Distance to Instability Property • Definition: A function : ∗ → ℜ is −stable at a data set if • For any data set ′ ∈ ∗ , with Δ′ ≤ , = ( ′ ) • Distance to instability: max( − ) • Objective: Output () while preserving differential privacy All data sets Unstable data sets Distance > Stable data sets Propose-Test-Release (PTR) framework [DL09, KRSY11, Smith T.’13] A Meta-algorithm: Propose-Test-Release (PTR) Basic tool: Laplace mechanism 1. ← max( − ) 2. ← + 3. If > (1/) , 1 then return (), else return ⊥ Theorem: The algorithm is , −differentially private Theorem: If is 2 1/ outputs () -stable at , then w.p. ≥ 1 − the algorithm Recap: Propose-Test-Release Framework (PTR) TBD: Some global sensitivity one query 1. ← max( − ) 2. ← + 3. If > (1/) , 1 then return (), else return ⊥ Theorem: The algorithm is , −differentially private Theorem: If is 2 1/ outputs () -stable at , then w.p. ≥ 1 − the algorithm Instantiation of PTR for the LASSO = LASSO: ∈ 1 arg min || ∈ℝ 2 − ||22 + Λ||||1 • Set function =support of • Issue: For , distance to instability might not be efficiently computable + From [Smith,T.’13] Consistency conditions Perturbation stability Proxy conditions This talk Consistency conditions Perturbation stability Proxy conditions (Efficiently testable with privacy) Perturbation Stability of the LASSO = LASSO: ∈ 1 arg min || ∈ℝ 2 + − ||22 + Λ||||1 () Theorem: Consistency conditions on LASSO are sufficient for perturbation stability Proof Sketch: 1. Analyze Karush-Kuhn-Tucker (KKT) optimality conditions at 2. Show that support() is stable via using ‘’dual certificate’’ on stable instances Perturbation Stability of the LASSO = Proof Sketch: 1 Gradient of LASSO ()= − − + Λ||||1 Lasso objective on 0 ∈ () Lasso objective on ′ ′ 0 ∈ ′ (′) + Perturbation Stability of the LASSO = Proof Sketch: 1 Gradient of LASSO ()= − − + Λ||||1 Argue using the optimality conditions of () and ′ (′) 1. No zero coordinates of become non-zero in ′ (use mutual incoherence condition) 2. No non-zero coordinates of become zero in ′ (use restricted strong convexity condition) + Perturbation Stability Test for the LASSO = + 0 0 Γ: Support of Γ c : Complement of the support of Test for the following (real test is more complex): • Restricted Strong Convexity (RSC): Minimum eigenvalue of ΓT XΓ is Ω() • Strong stability: Negative of the (absolute) coordinates of the gradient of the least-squared loss in Γ are ≪ Λ Geometry of the Stability of LASSO = + Intuition: Strong convexity ensures supp()⊆ supp(′) 1. Strong convexity ensures ||Γ − ′Γ ||∞ is small Lasso objective along Γ 2. If ∀, |Γ ()| is large, then ∀, ′Γ > 0 3. Consistency conditions imply ∀, |Γ ()| is large Dimension 2 in Γ Dimension 1 in Γ Geometry of the Stability of LASSO = + Intuition: Strong stability ensures no zero coordinate in becomes non-zero in ′ Lasso objective along Γ c Slope: Λ Slope: -Λ Dimension 1 in Γ Dimension 2 in Γ • For the minimizer to move along Γ , the perturbation to the gradient of least-squared loss has to be large Geometry of the Stability of LASSO = + Gradient of the least-squared loss: Γ − − = Lasso objective along Γ c Γc Slope: Λ Slope: -Λ Dimension 1 in Γ Dimension 2 in Γ • Strong stability: | | ≪ Λ for all ∈ Γ ⇒ ∈ Γ has a sub-gradient of zero for LASSO(′) Making the Stability Test Private (Simplified) = + Test for Restricted Strong Convexity: 1 2 Test for strong stability: 2 Issue: If 1 > 1 and 2 > 2 , then sensitivities are Δ1 and Δ2 1 Our solution: Proxy distance = max − min Δ • has global sensitivity of one 1 and 2 are both large and insensitive + 1,0 Private Model Selection with Optimal Sample Complexity = Nearly optimal sample complexity + 1. Compute = function of 1 () and 2 2. ← + 3. If > (/) , 1 then return (), else return ⊥ Theorem: The algorithm is , −differentially private Theorem: Under consistency conditions , log > 2 3 and = Ω( log ), w.h.p. the support of ∗ is output. Here = log(1/)/. Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Concluding Remarks • Diff. privacy and robust (stable) machine learning are closely related • Not in this talk: • Private machine learning via bootstrapping [Smith T.’13] • Private non-convex learning via bootstrapping [BDMRTW] • False discovery rate control via differential privacy [DFHPR14] Open Questions • Develop the theory behind private non-convex learning • Analyze algorithms like expectation maximization and alternating minimization • Private learning with time-series data (e.g., auto-regressive models) • Private matrix completion (the Netflix problem) using Frank-Wolfe