### - Lorentz Center

```Preserving Validity in Adaptive
Data Analysis
Moritz Hardt
Joint work with Cynthia Dwork, Vitaly Feldman,
Toni Pitassi, Omer Reingold, Aaron Roth
Statistical Estimation
Data domain X, class labels Y
Unknown distribution D over X × Y
Data set S of size n sampled i.i.d. from D
Problem: Given function q : X × Y ⟶ [0,1], want to
estimate expected value of q over D (“statistic”)
Notation:
D[q] = expected value of q over D
S[q] = average of q over S
Example: Empirical Loss Estimation
We trained a classifier f : X ⟶ Y
and want to know how good it is
We can estimate 0/1-loss of classifier using
the hypothesis q(x,y) = 1{ f(x) ≠ y }
S[q] = empirical loss with respect to sample S
D[q] = true loss with respect to unknown D
Example 2: Statistical Query Model
Function q specifies a statistical query.
Estimate a of D[q] is an answer to the query.
We can implement almost all (reasonable)
learning algorithms using sufficiently accurate
Statistical Validity:
Sample versus Distribution
Definition. A sample estimate a of D[q] is
statistically valid if |a − D[q]| < o(1)
In particular S[q] is statistically valid if
|S[q] − D[q]| < o(1)
Estimation Problem
Def: Given data set D of size n,
an algorithm estimates the statistic D[q] if
it returns a statistically valid estimate of D[q]
Problem: Given data set D of size n,
how many statistics can we estimate?
q1,q2,…,qk
a1,a2,…,ak
Algorithm
gets sample of size n
Data
analyst
Theorem [Folklore]:
There is an algorithm that estimates exp(n) nonadaptively chosen statistics.
Proof:
Fix functions q1,...,qk.
Sample data set S of size n.
Use Hoeffding bound + union bound to argue
S[qi] has error o(1) for all 1 ≤ i ≤ k
q1
a1
q2
a2
...
qk
ak
Algorithm
gets sample of size n
Data
analyst
Naive bound
Theorem [folklore]. There is an algorithm that
can estimate nearly n adaptively chosen
statistics.
Proof:
Partition data into roughly n chunks. Use fresh
data set of small size for estimating each statistic.
Our main result
Theorem. There is an algorithm that estimates
Caveat: Running time poly(n,|X|)
where |X| is the domain size.
A computationally efficient estimator
Theorem. There is a computationally efficient
algorithm that can estimate nearly n2
Can we do better?
No!
Theorem [H-Ullman,Ullman-Steinke]
There is no computationally efficient
algorithm that can estimate n2+o(1) adaptively
chosen statistics.
power of
the analyst
statistics
impossible
[HU,US]
possible
statistics
possible
exp(n)
statistics
possible
[folklore]
power of the
algorithm
poly-time
exp-time
False Discovery
“Trouble at the Lab” – The Economist
Preventing false discovery
Powerful results such as Benjamini-Hochberg
work on controlling False Discovery Rate
– 20000+ citations
Theory focuses on non-adaptive hypothesis testing
Data dredging, data snooping, fishing,
p-hacking, post-hoc analysis,
garden of the forking paths
Some caution strongly against it:
“Pre-registration” — specify entire experimental setup ahead of time
Humphreys, Sanchez, Windt (2013), Monogan (2013)
The most valuable statistical analyses often arise
only after an iterative process involving the data
— Gelman, Loken (2013)
Our results: Rich adaptive data analysis is
possible with provable confidence bounds.
Differential Privacy
Notion designed for privacy protection in
statistical data analysis.
Here: We’ll use Differential Privacy as a tool to
achieve statistical validity
Differential Privacy
(Dwork-McSherry-Nissim-Smith 06)
Informal Definition. A randomized algorithm A is
differentially private if for any two data sets S,S’ that
differ in at most one element, the random variables
A(S) and A(S’) are “indistinguishable”.
Density
A(S)
A(S’)
ratio
bounded
by 1±o(1)
Outputs
Why Differential Privacy?
DP is a stability guarantee: Changing one sample
point implies small change in output
General theme: Stability implies generalization
– Known in non-adaptive setting (BousquettElisseeff 2000,…,SSSS10)
– new technical challenges due to adaptivity
Why Differential Privacy?
DP comes with strong composition guarantees
sition of multiple differentially private algorithms is still d
Other stability notions don’t compose.
Transfer Theorem
Differential privacy and accuracy on the sample
implies accuracy on the distribution
Can instantiate transfer theorem
with existing DP algorithms.
Example:
Private Multiplicative Weights [H-Rothblum]
gives main theorem.
Questions?
```