### Slides - Jun Zhang

```Jun Zhang, Graham Cormode, Cecilia M. Procopiuc,
Divesh Srivastava, Xiaokui Xiao

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
company
institute
sensitive
database
public
company
similar properties
accurate inference
sensitive
database
synthetic
database  ∗
How can we design such a
private data release algorithm?

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

Definition of -Differential Privacy
◦ A randomized data release algorithm  satisfies -differential
privacy, if for any two neighboring datasets , ′ and for any
possible synthetic data ∗ ,
Pr   = ∗ ≤ exp() ⋅ Pr  ′ = ∗
Name
Has cancer?
Name
Has cancer?
Alice
Yes
Alice
Yes
Bob
No
Bob
No
Chris
Yes
Chris
Yes
Denise
Yes
Denise
Yes
Eric
No
Eric
No
Frank
Yes
Frank
No

A general approach to achieve differential privacy is
injecting Laplace noise to the output, in order to cover
the impact of any individual!

More details in Preliminaries part of the paper
Design a data release algorithm with
differential privacy guarantee.

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

To build a synthetic data, we need to understand the
tuple distribution Pr[∗] of the sensitive data.
convert
sensitive
database
+ noise
full-dim
tuple distribution
sample
noisy
distribution
synthetic
database  ∗


Example: Database has 10M tuples, 10 attributes
(dimensions), and 20 values per attribute:
Scalability: full distribution Pr[∗] has 2010 ≈ 10 cells
◦ most of them have non-zero counts after noise injection
◦ privacy is expensive (computation, storage)

Signal-to-noise: avg. information in each cell
10−6 ; avg. noise is 10 (for  = 0.1)
10
is
10
=
Previous solutions suffer from
either scalability or signal-to-noise problem

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
convert
sensitive
database
sample
+ noise
full-dim
tuple distribution
noisy
distribution
synthetic
database  ∗
approximate
convert
+ noise
a set of low-dim
distributions
noisy low-dim
distributions
sample

The advantages of using low-dimensional distributions
◦ easy to compute
◦ small domain -> high signal density -> robust against noise

But, how to find a set of low-dim distributions that
provides a good approximation to full distribution?

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

A 5-dimensional database:
Pr
Pr  |
age
workclass
income
education
title
Pr  |
Pr  |
Pr  |

A 5-dimensional database:
age
workclass
income
education
title
Pr ∗ ≈ Pr  ⋅ Pr  |  ⋅ Pr  |
⋅ Pr  |  ⋅ Pr  |
age
workclass
income
education
title
Pr ∗ ≈ Pr  ⋅ Pr |  ⋅ Pr  | ,
⋅ Pr  | ,  ⋅ Pr  | ,
Quality of Bayesian network decides the quality of
approximation

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

STEP 1: Choose a suitable Bayesian network
◦ must in a differentially private way

STEP 2: Compute conditional distributions implied by
◦ straightforward to do under differential privacy
◦ inject noise – Laplace mechanism

STEP 3: Generate synthetic data by sampling from
◦ post-processing: no privacy issues

Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information  of its
edges
(, ) ,
, :edge
where  ,  =
∈ ∈
Pr ,
Pr[, ] log
Pr  Pr
.

Finding optimal 1-degree Bayesian network was solved
in [Chow-Liu’68]. It is a DAG of maximum in-degree 1,
and maximizes the sum of mutual information  of its
edges
⇔ finding the maximum spanning tree, where the
weight of edge (, ) is mutual information (, ).

Build a 1-degree BN for database

Alan
0
0
0
0
Bob
0
0
0
0
Cykie
1
1
1
0
David
0
0
0
0
Eric
1
1
0
0
Frank
1
1
0
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0

Start from a random attribute
A
C
B
D

Select next tree edge by its mutual information
A
0.5
0.5
B
0.5 0.2
Alan
0
0.3
Bob
0

0
0
Cykie
1
1
1
David
0.5 0.5 0
Eric
1
0
0
1
0
Frank
1
1
0
George
0
0
0
0
Helen
1
1
1
0
Ivan
0
0
0
0
Jack
1
1
0
0
0
C
D

0
0
0
candidates:
0
→
0
0 →
0 →

Select next tree edge by its mutual information
A
=
B
= .
C
candidates:
→
→
→
=
D

Select next tree edge by its mutual information
A
C
B
D

Select next tree edge by its mutual information
=
B
C
= .
A
candidates:
→
→
→
→
=
= .
D

Select next tree edge by its mutual information
A
C
DONE!
B
D

It is NP-hard to train the optimal -degree Bayesian
network, when  > 1 [JMLR’04].

Most approximation algorithms are too complicated to
be converted into private algorithms.

In our paper, we find a way to extend the Chow-Liu
solution (1-degree) to higher degree cases.

In this talk, we focus on 1-degree cases for simplicity.

Do it under Differential Privacy!
(Non-private) select the edge with maximum
 (Private)  is data-sensitive
-> the best edge is also data-sensitive

Solution: randomized edge selection!
Databases

Edges

define  ,  →
How good edge  is as the result of
selection, given database
,
Return  with probability: Pr[] ∝ exp
⋅
2 Δ
′ , )
where Δ  = max

,

−
(
′
, ,
info
noise
1

Do it under Differential
Privacy!
Problem
solved?

Select edges with exponential mechanism
NO
◦ define (edge) = (edge)
◦ we prove Δ  = Θ(log  /), where  = ||. (Lemma 1)
Sensitivity (noise scale) log /
large

info
is
too
for
()
Pr  ∝ exp
⋅
2 log /
noise

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments
and  have a strong positive correlation
Functions
Range
(scale of info)
Sensitivity
(scale of noise)

Θ(1)
Θ(log  /)

Θ(1)
Θ(1/)
IDEA: define score to agree with  at maximum values
and interpolate linearly in-between
Pr[, ]
how far?
1
=−
2
min
Π:
Π: “optimal” dbns
over ,  that
maximize (, )
Π
Pr ,  − Π
1
Range of : Θ(1)
Sensitivity of : Θ(1/)
1
=−
2
0.5
0.4
0.5
=1
min
Π:
Pr ,  − Π
0.5 0.2
0.3
= 0.4
= −0.2
1
0.5
1.6
0.5
=1

and  of random distributions
correlation coefficient  = 0.9472

The Problem: Private Data Release
◦ Differential Privacy
◦ Challenges

The Algorithm: PrivBayes
◦ Bayesian Network
◦ Details of PrivBayes

Function : Linear vs. Logarithmic

Experiments

We use four datasets in our experiments

◦ census data of 45,222 individuals
◦ 15 attributes: age, workclass, education, marital status, etc.
◦ tuple domain size (full-dimensional): about 252
Query: all 2-way marginals
Query: all 3-way marginals
Query: build 4 classifiers
Query: build 4 classifiers

Differential privacy can be applied effectively for data release

Key ideas of the solution:
◦ Bayesian networks for dimension reduction
◦ carefully designed linear quality for exponential mechanism

Many open problems remain:
◦ extend to other forms of data: graph data, mobility data
◦ obtain alternate (workable) privacy definitions
Thanks!

Privacy, accuracy, and consistency too: a holistic solution to
contingency table release [PODS’07]
◦ incurs an exponential running time
◦ only optimized for low-dimensional marginals

Differentially private publication of sparse data [ICDT’12]
◦ achieves scalability, but no help for signal-to-noise problem

Differentially private spatial decompositions [ICDE’12]
◦ coarsens the histogram H to control nr. cells
◦ has some limits, e.g., range queries, ordinal domain

Assume that   ≤ |()|. A distribution
Pr[, ] maximizes the mutual information between
and  if and only if
◦ Pr  = 1/|()|, for any  ∈ ();
◦ For each  ∈ (), there is at most one  ∈
() with Pr ,  > 0.

two score functions for real  ⟹ log(1 + ) and

neighboring databases ⟹  and  + Δ

Sensitivity (noise) ⟹ max of derivative ∞ and 1
query
privacy budget
database
differentially private
algorithm
1. risk of privacy breach cumulates after