### Slides (short)

```*
Xin Luna Dong (AT&T Labs  Google Inc.)
Barna Saha, Divesh Srivastava (AT&T Labs-Research)
VLDB’2013
*
*
*
*
Cost
*Lots of money
*
Cost
*Lots of machines
*
Cost
*Lots of people
*
1250 books
from the 10
largest
sources
Gain
1260 books
from the first
35 sources
All 1265 books
from the first
537 sources
In total 894
sources, 1265
CS books
1213 books
from the 2
largest
sources
1096 books
from the
largest
source
CS books from AbeBooks.com
*
Gain
All 100 books
(gold
standard) from
the first 548
sources
78 books w.
correct
authors for
Vote
80 books w.
correct
authors for
Accu
93 > 80 books w.
correct authors
after 583 sources
(Vote)
90 > 80 books w.
correct authors
after 579 sources
(Accu)
CS books from AbeBooks.com
*
*Questions
*Is it best to integrate all data?
*How to spend the computing resources in
a wise way?
*How to wisely select sources before real
integration to balance the gain and the
cost?
*Prelude for data integration and outside
mapping, entity resolution, data fusion)
*
17 books w. correct
authors from 300
sources (budget)
14 books (17.6%
fewer) w. correct
authors from the
first 200 (33% less
resources) sources
CS books from AbeBooks.com
*
81 books (25% more) w.
correct authors from 526
sources (1% more)
65 books w. correct
authors (quality
requirement) from
the first 520 sources
CS books from AbeBooks.com
3
12
2.5
10
2
8
Marginal
Gain
Marginal
Cost
1.5
\$
\$
*
1
0.5
0
6
Gain
Cost
4
2
0
0
2 4 6 8 10
#(Resource Unit)
Marginal gain
II
Marginal cost
0 2 4 6 8 10
#(Resource Unit)
The law of
Diminishing Returns
Largest profit
*
Challenge 1. The Law of
Diminishing Returns
does not necessarily
hold, so multiple
marginal points
Marginal point with the
largest profit in this
ordering: 548 sources
Challenge 2. Each
source is different in
quality, so different
different marginal
points: best solution
integrates 26 sources
Challenge 3. Estimating
gain and cost w/o real
integration
CS books from AbeBooks.com
*
*Input
*S: a set of available sources
*F: integration model
*Output: subset Ŝ to maximize profit
GF(Ŝ)-CF(Ŝ)
*GF(Ŝ): Gain of integrating Ŝ using model F
*CF(Ŝ): Cost of integrating Ŝ using model F
*Gain and cost need to be in the same unit
to be comparable; e.g., \$
*
*Theorem I (NP-Completeness). Under the
arbitrary cost model (i.e., different sources
have different costs), Marginalism is NPcomplete.
*Theorem II (A greedy solution can obtain
arbitrarily bad results): Let dopt be the optimal
profit and d be the profit by a greedy solution.
For any θ, there exists an input set of sources
and a gain model s.t. d/dopt < θ.
*
Improvement I. Randomly select from Top-k solutions
Improvement II. Hill climbing to improve the initial solution
Improvement III. Repeat r times and choose the best solution
*
*Side contributions on data fusion
source should never decrease fusion quality
*Algorithms to estimate fusion quality:
dynamic programming
*
*Book data set: CS books at Abebooks.com in
2007
*894 sources
*1265 books
*24364 records
*Flight data set: Deep-Web sources for “flight status” in
2011
*38 sources
*1200 flights
*27469 records
*
Marginalism selects 165
sources; reaching the
highest quality
228 sources provide
books in gold standard
PopAccu outperforms Vote
and Accu, and is nearly
monotonic for “good” sources
*
Marginalism has higher profit
than MaxGLimitC and
MinCLimitG most of the time
*
Greedy solution often
cannot find the optimal
solution
GRASP (top-10, repeating
320 times) obtains nearly
optimal results
*
*Full-fledged source selection for data
integration
*Other quality measures: e.g., freshness,
consistency, redundancy; correlations,
copying relationships between sources
*Complex cost and gain models
*Selecting subsets of data from each source
*Other components of data integration:
schema mapping, entity resolution
The More the Better?
OR
Less is More?
```