Slides - Jun Zhang

Report
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc,
Divesh Srivastava, Xiaokui Xiao

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
company
institute
sensitive
database 
public
adversary
company
similar properties
accurate inference
sensitive
database 
synthetic
database  ∗
How can we design such a
private data release algorithm?
adversary

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

Definition of -Differential Privacy
◦ A randomized data release algorithm  satisfies -differential
privacy, if for any two neighboring datasets , ′ and for any
possible synthetic data ∗ ,
Pr   = ∗ ≤ exp() ⋅ Pr  ′ = ∗
Name
Has cancer?
Name
Has cancer?
Alice
Yes
Alice
Yes
Bob
No
Bob
No
Chris
Yes
Chris
Yes
Denise
Yes
Denise
Yes
Eric
No
Eric
No
Frank
Yes
Frank
No

A general approach to achieve differential privacy is
injecting Laplace noise to the output, in order to cover
the impact of any individual!

More details in Preliminaries part of the paper
Design a data release algorithm with
differential privacy guarantee.

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

To build a synthetic data, we need to understand the
tuple distribution Pr[∗] of the sensitive data.
convert
sensitive
database 
+ noise
full-dim
tuple distribution
sample
noisy
distribution
synthetic
database  ∗


Example: Database has 10M tuples, 10 attributes
(dimensions), and 20 values per attribute:
Scalability: full distribution Pr[∗] has 2010 ≈ 10 cells
◦ most of them have non-zero counts after noise injection
◦ privacy is expensive (computation, storage)

Signal-to-noise: avg. information in each cell
10−6 ; avg. noise is 10 (for  = 0.1)
10
is
10
=
Previous solutions suffer from
either scalability or signal-to-noise problem

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
convert
sensitive
database 
sample
+ noise
full-dim
tuple distribution
noisy
distribution
synthetic
database  ∗
approximate
convert
+ noise
a set of low-dim
distributions
noisy low-dim
distributions
sample

The advantages of using low-dimensional distributions
◦ easy to compute
◦ small domain -> high signal density -> robust against noise

But, how to find a set of low-dim distributions that
provides a good approximation to full distribution?

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

A 5-dimensional database:
Pr 
Pr  | 
age
workclass
income
education
title
Pr  | 
Pr  | 
Pr  | 

A 5-dimensional database:
age
workclass
income
education
title
Pr ∗ ≈ Pr  ⋅ Pr  |  ⋅ Pr  | 
⋅ Pr  |  ⋅ Pr  | 
age
workclass
income
education
title
Pr ∗ ≈ Pr  ⋅ Pr |  ⋅ Pr  | , 
⋅ Pr  | ,  ⋅ Pr  | , 
Quality of Bayesian network decides the quality of
approximation

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

STEP 1: Choose a suitable Bayesian network 
◦ must in a differentially private way

STEP 2: Compute conditional distributions implied by 
◦ straightforward to do under differential privacy
◦ inject noise – Laplace mechanism

STEP 3: Generate synthetic data by sampling from 
◦ post-processing: no privacy issues

Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information  of its
edges
(, ) ,
, :edge
where  ,  =
∈ ∈
Pr , 
Pr[, ] log
Pr  Pr 
.

Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information  of its
edges
⇔ finding the maximum spanning tree, where the
weight of edge (, ) is mutual information (, ).

Build a 1-degree BN for database




Alan
0
0
0
0
Bob
0
0
0
0
Cykie
1
1
1
0
David
0
0
0
0
Eric
1
1
0
0
Frank
1
1
0
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0

Start from a random attribute 
A
C
B
D

Select next tree edge by its mutual information
A
0.5
0.5
B
0.5 0.2 
Alan
0
0.3
Bob
0

0
0
Cykie
1
1
1
David
0.5 0.5 0
Eric
1
0
0
1
0
Frank
1
1
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0
0
C
D


0
0
0
candidates:
0
→
0
0 → 
0 → 

Select next tree edge by its mutual information
A
=
B
 = . 
C
candidates:
→
→
→
=
D

Select next tree edge by its mutual information
A
C
B
D

Select next tree edge by its mutual information
=
B
C
 = . 
A
candidates:
→
→
→
→
=
 = . 
D

Select next tree edge by its mutual information
A
C
DONE!
B
D

It is NP-hard to train the optimal -degree Bayesian
network, when  > 1 [JMLR’04].

Most approximation algorithms are too complicated to
be converted into private algorithms.

In our paper, we find a way to extend the Chow-Liu
solution (1-degree) to higher degree cases.

In this talk, we focus on 1-degree cases for simplicity.

Do it under Differential Privacy!
(Non-private) select the edge with maximum 
 (Private)  is data-sensitive
-> the best edge is also data-sensitive

Solution: randomized edge selection!
Databases

Edges

define  ,  → 
How good edge  is as the result of
selection, given database 
  , 
Return  with probability: Pr[] ∝ exp
⋅
2 Δ 
′ , )
where Δ  = max

,

−
(
′
, ,
info
noise
1

Do it under Differential
Privacy!
Problem
solved?

Select edges with exponential mechanism
NO
◦ define (edge) = (edge)
◦ we prove Δ  = Θ(log  /), where  = ||. (Lemma 1)
Sensitivity (noise scale) log /
 large
 
info
is
too
for
()
Pr  ∝ exp
⋅
2 log /
noise

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
 and  have a strong positive correlation
Functions
Range
(scale of info)
Sensitivity
(scale of noise)

Θ(1)
Θ(log  /)

Θ(1)
Θ(1/)
IDEA: define score to agree with  at maximum values
and interpolate linearly in-between
Pr[, ]
how far?
1
=−
2
min
Π:
Π: “optimal” dbns
over ,  that
maximize (, )
Π
Pr ,  − Π
1
Range of : Θ(1)
Sensitivity of : Θ(1/)
1
=−
2
0.5
0.4
0.5
=1
min
Π:
Pr ,  − Π
0.5 0.2
0.3
 = 0.4
 = −0.2
1
0.5
1.6
0.5
=1


 and  of random distributions
correlation coefficient  = 0.9472

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
Adult dataset

We use four datasets in our experiments
◦ Adult, NLTCS, TPC-E, BR2000

Adult dataset
◦ census data of 45,222 individuals
◦ 15 attributes: age, workclass, education, marital status, etc.
◦ tuple domain size (full-dimensional): about 252
Query: all 2-way marginals
Query: all 3-way marginals
Adult,  = gender
Adult,  = education
Query: build 4 classifiers
Adult,  = gender
Adult,  = education
Query: build 4 classifiers

Differential privacy can be applied effectively for data release

Key ideas of the solution:
◦ Bayesian networks for dimension reduction
◦ carefully designed linear quality for exponential mechanism

Many open problems remain:
◦ extend to other forms of data: graph data, mobility data
◦ obtain alternate (workable) privacy definitions
Thanks!

Privacy, accuracy, and consistency too: a holistic solution to
contingency table release [PODS’07]
◦ incurs an exponential running time
◦ only optimized for low-dimensional marginals

Differentially private publication of sparse data [ICDT’12]
◦ achieves scalability, but no help for signal-to-noise problem

Differentially private spatial decompositions [ICDE’12]
◦ coarsens the histogram H to control nr. cells
◦ has some limits, e.g., range queries, ordinal domain

Assume that   ≤ |()|. A distribution
Pr[, ] maximizes the mutual information between
 and  if and only if
◦ Pr  = 1/|()|, for any  ∈ ();
◦ For each  ∈ (), there is at most one  ∈
() with Pr ,  > 0.

two score functions for real  ⟹ log(1 + ) and 

neighboring databases ⟹  and  + Δ

Sensitivity (noise) ⟹ max of derivative ∞ and 1
query 
privacy budget 
noisy answer 
database
differentially private
algorithm
1. risk of privacy breach cumulates after
answering multiple queries
2. It requires specific DP algorithm for
every particular query
user
query 
noisy answer 
private data release
privacy budget 
synthetic data
Reusability: only access sensitive data once
Generality: support most queries

similar documents