Report

Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Raef Bassily Penn State Adam Smith Penn State (work done at BU/Harvard) Abhradeep Thakurta Yahoo! Labs (work done at MSR/Stanford) Privacy in Statistical Databases x1 x2 . . . xn Trusted curator Users ( A queries ) answers Government, researchers, businesses (or) malicious adversary • Two conflicting goals: Utility vs. Privacy • Balancing these goals is tricky: No control over external sources of information Anonymization is unreliable: [Narayanan-Shmatikov’08], [Korolova’11], [Calendrino et al.’12], … social networks Differential privacy [Dwork-McSherry-Nissim-Smith’06] Two regimes: x1 x2 xn A local random coins A(x) x1 -differential privacy A(x') x2’ A d >0 -differential privacy, xn local random coins Def.: A randomized algorithm A isif they differ -differentially private if, for Datasets x and x’ are called neighbors in one record. x and all data sets differ one element, for all events S, x ' that Require: Neighbor datasets induce closeindistributions on outputs Think of Think independent of “Almost same” conclusions will be reached of whether any d = negl(n) individual opts into or opts out of the data set. This work Construct efficient and differentially private algorithms for convex empirical risk minimization with optimal excess risk Convex empirical risk minimization • Dataset x = ( x1,..., xn ) ÎU n. • Convex set C Í » p. • Loss function ℓ:C ´U ® » . (× ; x )is convex for all x. where ℓ C Convex empirical risk minimization • Goal: Find a “parameter”q ÎC • Dataset x = ( x1,..., xn ) ÎU n. that minimizes the empirical risk: n 1 L(q ;x) = å ℓ(q ; xi ) • Convex set C Í » p. n i=1 • Loss function ℓ:C ´U ® » . L(q ;x) where ℓ × ; x is convex for all x. ( ) q* Actual minimizer C Convex empirical risk minimization • Goal: Find a “parameter”q ÎC • Dataset x = ( x1,..., xn ) ÎU n. that minimizes the empirical risk: n 1 L(q ;x) = å ℓ(q ; xi ) • Convex set C Í » p. n i=1 • Loss function ℓ:C ´U ® » . L(q ;x) where ℓ × ; x is convex for all x. ( ) • Output qˆ such that Excess risk q* Actual minimizer qˆ Output C Examples • Median • Linear regression 2 1 n L (q ; X ) = å( yi - áxi , q ñ) n i=1 • Support vector machine where h(z) = (1- z) + Private convex ERM [Chaudhuri-Monteleoni 08 & -- Sarwate 11] ℓ Loss , Convex set C Dataset x = ( x1,..., xn ) Random coins A -diff. private qˆ ÎC • Utility measured by (worst-case) risk:x ) Privacy: A is differentially privateexpected in input xexcess = ( x1,..., n ( ) E é L qˆ;x ù - min L (q ;x ) ë û q ÎC 1 n (Recall that L(q ;x) = å ℓ(q ; xi ) ) n i=1 • Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, KiferSmith-Thakurta‘12, Smith-Thakurta ’13, …] Why care about privacy in ERM? • Median: Dual formMinimizer of SVM: typically a subset of the exact data is alwayscontains a data point. ℓ(q ; x1 ) + ℓ(q ; x2 ) + ℓ(q ; x3 ) points in the clear. ℓ(q ; x3 ) ℓ(q ; x1 ) x1 x2 x3 q x1 x2 x 3 q q * = x2 is the minimizer Contributions 1. New algorithms with optimal excess risk assuming: • 2. Loss function is Lipschitz. Non-smooth loss is common: SVM, median, … • Parameter set C is bounded. Applying their technique in (Separate set of algorithms for strongly convex loss.)smoothing the general requires loss, introducing extra error. Matching lower bounds • Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] additionally assumes ℓis smooth (bounded 2nd derivative) • This work improves bounds by factor of n Results – upper bounds (dataset size = n, C Ì » ) p Λ-strongly convex Lipschitz Privacy Excess risk Technique Exponential sampling (inspired by [McSherry-Talwar’07]) -DP -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Localization (new technique) -DP -DP Noisy stochastic gradient descent (or localization) ℓis 1-Lipschitz on parameter set C of diameter £ 1. Results – upper bounds (dataset size = n, C Ì » ) p Λ-strongly convex Lipschitz Privacy Excess risk Technique Exponential sampling (inspired by [McSherry-Talwar’07]) -DP -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Localization (new technique) -DP -DP Noisy stochastic gradient descent (or localization) ℓis 1-Lipschitz on parameter set C of diameter £ 1. Results – upper bounds (dataset size = n, C Ì » ) p Λ-strongly convex Lipschitz Privacy Excess risk Technique Exponential sampling (inspired by [McSherry-Talwar’07]) -DP -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Localization (new technique) -DP -DP Noisy stochastic gradient descent (or localization) ℓis 1-Lipschitz on parameter set C of diameter £ 1. Results – upper bounds (dataset size = n, C Ì » ) p Λ-strongly convex Lipschitz Privacy Excess risk Technique -DP -DP Localization (new technique) -DP -DP Noisy stochastic gradient descent (or localization) ℓ is 1-Lipschitz on parameter set C of diameter £ 1. Results – lower bounds Strongly convex Lipschitz Privacy -DP -DP -DP -DP Excess risk Form of ℓ used Linear: - áq , xñ, Reduction from -DP release of 1-way marginals [HT’10] C = unit ball Quadratic: Reduction from -DP release of 1-way marginals [BUV’13] Results – upper bounds Λ-strongly convex Lipschitz Privacy Excess risk Technique Exponential sampling (inspired by [McSherry-Talwar’07]) -DP -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10] and [Chaudhuri-Sarwate-Song’13]) Localization (new technique) -DP -DP Noisy stochastic gradient descent (or, Localization) ℓis 1-Lipschitz on parameter set C of diameter £ 1. Optimal Noisy Stochastic Gradient Descent Algorithm Noisy stochastic gradient descent algorithm • Inputs: Data x = ( x ,..., x ), 1-Lipschitz loss ℓ, convex set C, 1 n Noisy stochastic gradient descent algorithm • Choose arbitrary q1 ÎC q1 C Noisy stochastic gradient descent algorithm • For t = 1 to n2: q1 C Noisy stochastic gradient descent algorithm At iteration t : R 1) Sample x ¬¾ ¾ {x1,..., xn }. ℓ(qt ; x) qt C Noisy stochastic gradient descent algorithm At iteration t : R 1) Sample x ¬¾ ¾ {x1,..., xn }. ℓ(qt ; x) qt C direction of the noisy gradient = -ht [ Ñℓ(qt ; x) + bt ] Noisy stochastic gradient descent algorithm At iteration t : R 1) Sample x ¬¾ ¾ {x1,..., xn }. Learning rate ℓ(qt ; x) qt C direction of the noisy gradient = -ht [ Ñℓ(qt ; x) + bt ] Noisy stochastic gradient descent algorithm At iteration t : R 1) Sample x ¬¾ ¾ {x1,..., xn }. ℓ(qt ; x) qt q t +1 C direction of the noisy gradient = -ht [ Ñℓ(qt ; x) + bt ] Noisy stochastic gradient descent algorithm At iteration t+1 : R 1) Sample x' ¬¾ ¾ {x1,..., xn }. ℓ(qt+1; x') qt q t +1 q t +2 C -ht+1 [ Ñℓ(qt+1; x') + bt+1 ] Fresh data sample Noisy stochastic gradient descent algorithm Repeat for n 2 iterations, then output q 2 . n ℓ(qt+1; x') qt q t +1 q t +2 C -ht+1 [ Ñℓ(qt+1; x') + bt+1 ] Fresh data sample Privacy of the noisy SGD Noisy SGD algorithm is -differentially private. • Key point: Sampling amplifies privacy [KLNRS’08]: After T = n 2 iterations, by strong composition [DRV’10], privacy degrades from to . Optimal Exponential Sampling Algorithm Exponential sampling algorithm • Define a probability distribution over C : • Output a sample qˆ from f (× ;x) An instance of the exponential mechanism [McSherry-Talwar’08] L(q ;x) 4g 3g Tight utility analysis via a “peeling” argument Efficient construction based on rapidly mixing MCMC Uses [Applegate-Kannan’91] as a subroutine. Exploits structure of convex functions: f (q ;x) 2g A Provides multiplicative convergence guarantee. are decreasing in volume g 1 , A2 , …purely Does not follow existing results. Shows that Pr éqˆdirectly ÎA1 ù »from 1 when ë û 0 q* A1 A2 A3 A4 q Summary 1. New algorithms with optimal excess risk assuming: • Loss function is Lipschitz. • Parameter set C is bounded. (Separate set of algorithms for strongly convex loss.) 2. Matching lower bounds Not in this talk: • New Localization technique: optimal algorithm for strongly convex loss. • Generalization error guarantees: Not known to be tight in general.