Report

Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments company institute sensitive database public adversary company similar properties accurate inference sensitive database synthetic database ∗ How can we design such a private data release algorithm? adversary The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments Definition of -Differential Privacy ◦ A randomized data release algorithm satisfies -differential privacy, if for any two neighboring datasets , ′ and for any possible synthetic data ∗ , Pr = ∗ ≤ exp() ⋅ Pr ′ = ∗ Name Has cancer? Name Has cancer? Alice Yes Alice Yes Bob No Bob No Chris Yes Chris Yes Denise Yes Denise Yes Eric No Eric No Frank Yes Frank No A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual! More details in Preliminaries part of the paper Design a data release algorithm with differential privacy guarantee. The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments To build a synthetic data, we need to understand the tuple distribution Pr[∗] of the sensitive data. convert sensitive database + noise full-dim tuple distribution sample noisy distribution synthetic database ∗ Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute: Scalability: full distribution Pr[∗] has 2010 ≈ 10 cells ◦ most of them have non-zero counts after noise injection ◦ privacy is expensive (computation, storage) Signal-to-noise: avg. information in each cell 10−6 ; avg. noise is 10 (for = 0.1) 10 is 10 = Previous solutions suffer from either scalability or signal-to-noise problem The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments convert sensitive database sample + noise full-dim tuple distribution noisy distribution synthetic database ∗ approximate convert + noise a set of low-dim distributions noisy low-dim distributions sample The advantages of using low-dimensional distributions ◦ easy to compute ◦ small domain -> high signal density -> robust against noise But, how to find a set of low-dim distributions that provides a good approximation to full distribution? The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments A 5-dimensional database: Pr Pr | age workclass income education title Pr | Pr | Pr | A 5-dimensional database: age workclass income education title Pr ∗ ≈ Pr ⋅ Pr | ⋅ Pr | ⋅ Pr | ⋅ Pr | age workclass income education title Pr ∗ ≈ Pr ⋅ Pr | ⋅ Pr | , ⋅ Pr | , ⋅ Pr | , Quality of Bayesian network decides the quality of approximation The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments STEP 1: Choose a suitable Bayesian network ◦ must in a differentially private way STEP 2: Compute conditional distributions implied by ◦ straightforward to do under differential privacy ◦ inject noise – Laplace mechanism STEP 3: Generate synthetic data by sampling from ◦ post-processing: no privacy issues Finding optimal 1-degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information of its edges (, ) , , :edge where , = ∈ ∈ Pr , Pr[, ] log Pr Pr . Finding optimal 1-degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information of its edges ⇔ finding the maximum spanning tree, where the weight of edge (, ) is mutual information (, ). Build a 1-degree BN for database Alan 0 0 0 0 Bob 0 0 0 0 Cykie 1 1 1 0 David 0 0 0 0 Eric 1 1 0 0 Frank 1 1 0 0 George 0 0 0 0 Helen 1 1 1 0 Ivan 0 0 0 0 Jack 1 1 0 0 Start from a random attribute A C B D Select next tree edge by its mutual information A 0.5 0.5 B 0.5 0.2 Alan 0 0.3 Bob 0 0 0 Cykie 1 1 1 David 0.5 0.5 0 Eric 1 0 0 1 0 Frank 1 1 0 George 0 0 0 0 Helen 1 1 1 0 Ivan 0 0 0 0 Jack 1 1 0 0 0 C D 0 0 0 candidates: 0 → 0 0 → 0 → Select next tree edge by its mutual information A = B = . C candidates: → → → = D Select next tree edge by its mutual information A C B D Select next tree edge by its mutual information = B C = . A candidates: → → → → = = . D Select next tree edge by its mutual information A C DONE! B D It is NP-hard to train the optimal -degree Bayesian network, when > 1 [JMLR’04]. Most approximation algorithms are too complicated to be converted into private algorithms. In our paper, we find a way to extend the Chow-Liu solution (1-degree) to higher degree cases. In this talk, we focus on 1-degree cases for simplicity. Do it under Differential Privacy! (Non-private) select the edge with maximum (Private) is data-sensitive -> the best edge is also data-sensitive Solution: randomized edge selection! Databases Edges define , → How good edge is as the result of selection, given database , Return with probability: Pr[] ∝ exp ⋅ 2 Δ ′ , ) where Δ = max , − ( ′ , , info noise 1 Do it under Differential Privacy! Problem solved? Select edges with exponential mechanism NO ◦ define (edge) = (edge) ◦ we prove Δ = Θ(log /), where = ||. (Lemma 1) Sensitivity (noise scale) log / large info is too for () Pr ∝ exp ⋅ 2 log / noise The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments and have a strong positive correlation Functions Range (scale of info) Sensitivity (scale of noise) Θ(1) Θ(log /) Θ(1) Θ(1/) IDEA: define score to agree with at maximum values and interpolate linearly in-between Pr[, ] how far? 1 =− 2 min Π: Π: “optimal” dbns over , that maximize (, ) Π Pr , − Π 1 Range of : Θ(1) Sensitivity of : Θ(1/) 1 =− 2 0.5 0.4 0.5 =1 min Π: Pr , − Π 0.5 0.2 0.3 = 0.4 = −0.2 1 0.5 1.6 0.5 =1 and of random distributions correlation coefficient = 0.9472 The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function : Linear vs. Logarithmic Experiments Adult dataset We use four datasets in our experiments ◦ Adult, NLTCS, TPC-E, BR2000 Adult dataset ◦ census data of 45,222 individuals ◦ 15 attributes: age, workclass, education, marital status, etc. ◦ tuple domain size (full-dimensional): about 252 Query: all 2-way marginals Query: all 3-way marginals Adult, = gender Adult, = education Query: build 4 classifiers Adult, = gender Adult, = education Query: build 4 classifiers Differential privacy can be applied effectively for data release Key ideas of the solution: ◦ Bayesian networks for dimension reduction ◦ carefully designed linear quality for exponential mechanism Many open problems remain: ◦ extend to other forms of data: graph data, mobility data ◦ obtain alternate (workable) privacy definitions Thanks! Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07] ◦ incurs an exponential running time ◦ only optimized for low-dimensional marginals Differentially private publication of sparse data [ICDT’12] ◦ achieves scalability, but no help for signal-to-noise problem Differentially private spatial decompositions [ICDE’12] ◦ coarsens the histogram H to control nr. cells ◦ has some limits, e.g., range queries, ordinal domain Assume that ≤ |()|. A distribution Pr[, ] maximizes the mutual information between and if and only if ◦ Pr = 1/|()|, for any ∈ (); ◦ For each ∈ (), there is at most one ∈ () with Pr , > 0. two score functions for real ⟹ log(1 + ) and neighboring databases ⟹ and + Δ Sensitivity (noise) ⟹ max of derivative ∞ and 1 query privacy budget noisy answer database differentially private algorithm 1. risk of privacy breach cumulates after answering multiple queries 2. It requires specific DP algorithm for every particular query user query noisy answer private data release privacy budget synthetic data Reusability: only access sensitive data once Generality: support most queries