Trust and Profit Sensitive Ranking for Web Databases and On

Report
Trust and Profit Sensitive Ranking for Web
Databases and On-line Advertisements
Raju Balakrishnan
(PhD Proposal Defense)
Committee: Subbarao Kambhampati (chair)
Yi Chen
AnHai Doan
Huan Liu.
Agenda
 Problem 1: Ranking the Deep Web
– Need for New Ranking.
– SourceRank: Agreement Analysis.
– Computing Agreement and Collusion .
– Results & System Implementation.
 Proposed Work: Ranking the Deep Web Results.
 Problem 2: Ad-Ranking sensitive to Mutual Influences.
– Browsing model & Nature of Influence.
– Ranking Function and Generalizations.
– Results.
 Proposed Work: Mechanism Design for Ads.
Deep Web Integration Scenario
Millions of
sources
containing
structured
tuples
Uncontrolled
collection of
redundant
information
Mediator
Search engines
have nominal
access. We
don’t Google for
a “Honda Civic
2008 Tempe”
Web DB
Web DB
Web DB
Web DB
Web DB
Deep Web
Why Another Ranking?
Rankings are oblivious to result Importance & Trustworthiness
Example Query: “Godfather Trilogy” on Google Base
Importance: Searching for titles
matching with the query. None of
the results are the classic Godfather
Trustworthiness (bait and switch)
The titles and cover image match
exactly.
Prices are low. Amazing deal!
But when you proceed towards
check out you realize that the product
is a different one! (or when you open
the mail package, if you are really
unlucky)
Source Selection in the Deep Web
Problem: Given a user query, select a subset of sources
to provide important and trustworthy answers.
Surface web search combines link analysis with
Query-Relevance to consider trustworthiness and
relevance of the results.
Unfortunately, deep web records do not have the
hyper-links.
Source Agreement
Observations
 Many sources return answers to the same query.
 Comparison of semantics of the answers is
facilitated by structure of the tuples
Idea: Compute importance and trustworthiness
of sources based on the agreement of answers
returned by different sources.
Agreement Implies Trust & Importance.
Important results are likely to be returned by
a large number of sources.
e.g. For the query “Godfather” hundreds of
sources return the classic “The Godfather”
while a few sources return the little known
movie “Little Godfather”.
Two independent sources are not likely to
agree upon corrupt/untrustworthy answers.
e.g. The wrong author of the book (e.g.
Godfather author as “Nino Rota”) would not
be agreed by other sources. As we know,
truth is one (or a few), but lies are many.
Agreement is not just for the search
Which tire?
Agreement Implies Trust & Relevance
Probability of agreement of two
independently selected
irrelevant/false tuples is
1
1
1

P (f , f )
|U |
3
100 k
a
1
2
Probability of agreement or two
independently picked relevant
and true tuples is
Pa (r1, r 2) 
1
| RT |
| U || RT | Pa(r1, r 2)  Pa( f 1, f 2)
Method: Sampling based Agreement
W ( S 1  S 2)    (1   ) 
0.78
S3
0.22
where  induces the smoothing links
to account for the unseen samples.
R1, R2 are the result sets of S1, S2 .
S1
0.86
0.4
0.14
A( R1, R 2)
| R2 |
0.6
S2
Link semantics from Si to Sj with
weight w: Si acknowledges w
fraction of tuples in Sj. Since weight
is
the
fraction,
links
are
unsymmetrical.
Agreement is computed
using 200 key word
queries.
Partial titles of
movies/books are used as
queries.
Mean agreement over all
the queries are used as the
final agreement.
Method: Calculating SourceRank
How can I use the
agreement graph for
improved search?
• Source graph is viewed as a
markov chain, with edges as the
transition probabilities between
the sources.
• The
prestige
of
sources
considering transitive nature of
the
agreement
may
be
computed based on a markov
random walk.
SourceRank is equal to this stationary visit probability of
the random walk on the database vertex.
This static SourceRank may be combined with a queryspecific source-relevance measure for the final ranking.
Computing Agreement is Hard
Computing semantic agreement between two records is
the record linkage problem, and is known to be hard.
Godfather, The: The
Coppola Restoration
James Caan /
Marlon Brando more
The Godfather - The
Coppola Restoration
Giftset [Blu-ray]
Marlon Brando, Al
Pacino
Example
“Godfather”
tuples from two
web sources.
Note that titles
and castings are
denoted
differently.
Semantically same entities may be represented syntactically
differently by two databases (non-common domains).
Method: Computing Agreement
Agreement Computation has Three levels.
1. Comparing Attribute-Value
 Soft-TFIDF with Jaro-Winkler as the similarity measure is used.
2. Comparing Records.
 We do not assume predefined schema matching.
Instance of a bipartite
matching problem.
Optimal matching is O(v3 ).
 O(v ) Greedy matching is used. Values are greedily matched
against most similar value in the other record.
The attribute importance are weighted by IDF. (e.g. same titles
(Godfather) is more important than same format (paperback))
2
3. Comparing result sets.
 Using the record similarity computed above, result set similarities
are computed using the same greedy approach.
Detecting Source Collusion
The sources may copy data
from each other, or make
mirrors, boosting
SourceRank of the group.
Observation 1: Even non-colluding sources in the same domain may
contain same data.
e.g. Movie databases may contain all Hollywood movies.
Observation 2: Top-k answers of even non-colluding sources may be
similar.
e.g. Answers to query “Godfather” may contain all the three movies in
the Godfather trilogy.
Source Collusion--Continued
Basic Method: If two sources return same top-k
answers to the queries with large number of
answers (e.g. queries like “the” or “DVD”) they
are likely to be colluding.
We compute the degree of collusion of sources
as the agreement on large answer queries.
Words with highest DF in the crawl is used as
the queries.
The agreement between two databases are
adjusted for collusion by multiplying by
(1-collusion).
Factal: Search based on SourceRank
”I personally ran a handful of
test queries this way and got
much better [than Google
Products] results using Factal” -- Anonymous WWW’11 Reviewer.
http://factal.eas.asu.edu
Evaluation
Precision and DCG are compared with the following baseline
methods
1) CORI: Adapted from text database selection. Union of
sample documents from sources are indexed and
sources with highest number term hits are selected
[Callan et al. 1995].
2) Coverage: Adapted from relational databases. Mean
relevance of the top-5 results to the sampling queries
[Nie et al. 2004].
3) Google Products: Products Search that is used over
Google Base
All experiments distinguish the SourceRank from baseline
methods with 0.95 confidence levels.
Though combinations are not our competitors,
note that they are not better:
1.SourceRank implicitly considers query
relevance, as selected sources fetch answers
Precision
by query similarity. Combining again
with query
DCG
similarity may be an “overweighting”
2. Search is Vertical
Online Top-4 Sources-Movies
0.45
0.4
29%
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Coverage
SourceRank
CORI
SR-Coverage
18
SR-CORI
Online Top-4 Sources-Books
0.35
Precision
DCG
0.3
48%
0.25
0.2
0.15
0.1
0.05
0
Coverage
SourceRank
CORI
SR-Coverage
19
SR-CORI
Google Base Top-5 Precision-Books
0.5
675 Sources
0.4
0.3
0.2
0.1
0
24%
 675 Google Base
sources responding
to a set of book
queries are used as
the book domain
sources.
GBase-Domain is
the Google Base
searching only on
these 675 domain
sources.
Source Selection
by SourceRank
(coverage) followed
by ranking by Google
Base.
20
Google Base Top-5 Precision-Movies
0.25
209 Sources
25%
0.2
0.15
0.1
0.05
0
Gbase
Gbase-Domain
SourceRank
Coverage
21
Trustworthiness of Source Selection
1. Corrupted the results in sample
crawl by replacing attribute vales
not specified in the queries with
random strings (since partial
titles are the queries, we
corrupted attributes except
titles).
2.If the source selection is
sensitive to corruption, the ranks
should decrease with the
corruption levels.
Google Base Movies
60
SourceRank
Coverage
CORI
Decrease in Rank(%) 
50
40
30
20
10
0
-10
0
0.1
0.2
0.3
0.4
0.5
0.6
Corruption Level 
0.7
0.8
0.9
Every relevance measure
based on query-similarity are
oblivious to the corruption of
attributes unspecified in queries.
22
Trustworthiness- Google Base Books
45
SourceRank
Coverage
CORI
Decrease in Rank(%) 
40
35
30
25
20
15
10
5
0
-5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Corruption Level 
23
0.8
0.9
Collusion—Ablation Study
1
0.8
Collusion
Agreement
Adjusted Agreement
0.6
0.4
0.2
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Rank Correlation 
Two database with the
same one million tuples
from IMDB are created.
Correlation between the
ranking functions reduced
increasingly.
Observations:
1. At high correlation the
adjusted agreement is
very low.
2. Adjusted agreement is
almost the same as the
pure agreement at low
correlations.
Natural agreement will be
preserved while catching
near-mirrors.
24
Computation Time
Random walk is
known to be feasible in
large scale.
Time to compute the
agreements is
evaluated against
number of sources.
Note that the
computation is offline.
Easy to parallelize.
25
Publications and Recognition
SourceRank:Relevance and Trust Assessment for
Deep Web Sources Based on Inter-Source
Agreement.
Raju Balakrishnan, Subbarao Kambhampati.
Accepted for WWW 2011 (Full Paper).
Factal: Integrating Deep Web Based on Trust and
Relevance.
Raju Balakrishnan, Subbarao Kabmbhampati.
Accepted for WWW 2011 (Demonstration).
SourceRank:Relevance and Trust Assessment for
Deep Web Sources Based on Inter-Source
Agreement (Best Poster Award, WWW 2010).
Raju Balakrishnan, Subbarao Kambhampati.
WWW 2010 Pages 1055~1056.
Proposed Work: Ranking the Deep Web Results
Results from the selected sources need to be
combined and ranked. Ranking is online and
hence time critical. We plan to:
 Consider Higher order agreements.
e.g. Second order agreement, i.e. Friends of two tuples are
agreeing to each other.
 Computing and combining query-similarity.
Query similarity considering structure of tuples.
Positional voting, as sources rank tuples.
e.g. Borda count instead of set based agreement used currently.
Ranking Deep Web Results---Continued
Learning to combine ranking Parameters.
Multiple Parameters: agreement, lineage, query-similarity higher
order agreements etc. need to be combined.
Considering Diversity Ranking.
Incorporating diversity in the result level and in the source level.
 Large Scale Evaluation of Result Ranking.
Similar to source selection, the result ranking need to be evaluated
on multiple large scale datasets.
 Enhancing prototype.
Factal prototype may be enhanced with the result ranking.
Agenda
 Ranking the Deep Web
Search
engines
generate their multi– Need for
new ranking.
billion
dollar revenue
byAnalysis.
textual ads.
– SourceRank:
Agreement
Related
problem
of ranking
of ads. is
– Computing
Agreement
and Collusion
as important
the ranking of
– Results
& Systemas
Implementation.
results.
 Proposed Work:
Ranking the Deep Web Results.
 Problem 2: Ad-Ranking sensitive to Mutual
Influences.
A different
– Motivation and Problem Definition.
aspect of
– Browsing model & Nature of Influence ranking
– Ranking Function & Generalization
– Results.
 Proposed Work: Mechanism Design & Evaluations.
Ad Ranking: State of the Art
Sort by
Sort by
Bid Amount x Relevance
Bid Amount
Ads are Considered in Isolation, as both ignore Mutual
influences.
We consider ads as a set, and ranking is based
on user’s browsing model
User’s
Browsing
Model
If a2 is similar to a1
residual relevance of a2
•User browses down staring at the first
decreases and
abandonment
probability increases.
ad
• At every ad he May
Click the ad with relevance probability
R(a)  P(click(a) | view(a))
Abandon browsing with probability
Goes down to the next ad with probability
Process repeats for the
ads below with a reduced
probability
Mutual Influences
Three Manifestations of Mutual Influences on an ad
a are:
1. Similar ads placed above a
 Reduces user’s residual relevance of
a
a
 User may click on above ads may not view a
2. Relevance of other ads placed above
3. Abandonment probability of other ads placed
above a
 User may abandon search and may not view a
Expected Profit Considering Ad Similarities
Considering bids ( $(ai )), residual Relevance ( R ( ai )),
abandonment probability (  ( ai ) ), and similarities the expected
profit from a set of n results is,
i 1
n
Expected Profit =
 $(a )R (a ) 1  R (a )   (a )
i
i 1
r
i
r
j
j
j 1
THEOREM: Ranking maximizing expected profit considering
similarities between the results is NP-Hard
Even worse, constant ratio approximation algorithms are hard
(unless NP = ZPP) for diversity ranking problem
Proof is a reduction of independent set problem to choosing top-k ads
considering similarities.
Expected Profit Considering other two Mutual
Influences (2 and 3)
Dropping similarity, hence replacing Residual Relevance
( Rr (ai )) by Absolute Relevance ( R ( ai )),
i 1
n
Expected Profit =
 $(a )R(a )1  R(a )   (a )
i
i 1
i
j
j
j 1
Ranking to maximize this expected utility is a
sorting problem
Optimal Ranking
Rank ads in the descending order of:
$(a) R(a)
RF (a) 
R(a)   (a)
The physical meaning RF is the profit generated for
unit consumed view probability of ads
Higher ads have more view probability. Placing ads
producing more profit for unit consumed view
probability higher up is intuitive.
Comparison to current Ad Rankings
Sort by Bid Amount
Bid Amount x Relevance
 Assume abandonment
probability is zero
Assume  (a)  k  R(a)
 (a)  0
where
k
is a constant for all ads
$(a) R(a)
RF (a) 
 $(a)
R(a)
$( a ) R(a )
RF (a ) 
 $(a) R(a)
k
Assumes that the user
has infinite patience to
go down the results until
he finds the ad he wants.
Assumes that
abandonment probability
is negatively proportional
to relevance.
Generality of the Proposed Ranking
The generalized
ranking based on
utilities.
For documents
utility=relevance
For ads utility=bid
amount
Popular relevance
ranking
Quantifying Expected Profit
40
RF
Bid Amount x Relevance
Bid Amount
35
Expected Profit 
30
Abandonment
probability
Bid amount only
strategy
becomes
optimal at  (a)  0
Uniform Random as
0   (a)  
Relevance
Difference in profit
between RF and
competing strategy
is significant
25
20
Uniform random as
  R(a )  1
Number of Clicks
Zipf random with
exponent 1.5
15
35.9%
Proposed strategy
gives maximum
Bid Amounts
profit for the entire Uniform random
range
45.7%
10
5
0  $(a)  10
0
0
0.1
0.2
0.3
0.4
0.5

0.6
0.7
0.8
0.9
1
Publication and Recognition
Optimal Ad-Ranking for
Profit Maximization.
Raju Balakrishnan, Subbarao
Kabmbhampati.
WebDB 2008
Yahoo! Research Key
scientific Challenge award
for Computation
advertising, 2009-10
Proposed Work: Mechanism Design and Analysis
1. Ad-Auction based on the proposed ranking
Associate a pricing mechanism (e.g. Generalized Second
Pricing) wit the proposed ranking to formulate a complete
auction mechanism.
 Formulating an envy free equilibrium
At envy free equilibria, no advertiser will be able to
increase the profit by changing bids, and the auction
becomes stable.
 Analysis of advertiser’s profit and comparison
with the existing mechanisms.
The profits at equilibrium is likely to be the sustained
profit.
Proposed Work: Mining Ad Ranking Parameters
Evaluating the proposed ranking on Click Logs
 In addition to currently used CTR and bid amounts, the
proposed ranking requires the additional parameter
abandonment probability.
 Mining/Learning abandonment probability from click logs,
and using to improve ranking will ensure the applicability
of the proposed method.
 Since the access to the ad click logs is restricted within
search providers, I am hoping to perform this work
as part of a possible summer internship (this is a risk
factor).
Contributions (completed & proposed)
1. SourceRank based source selection sensitive to
trustworthiness and importance for the deep web.
2. A wisdom of sources approach to rank deep web
results considering trust and relevance.
3. An optimal generalized ranking for ads and search
results.
4. A ranking framework optimal with respect to the
perceived relevance of search snippets, and
abandonment probability.
5. A complete auction mechanism extending the
optimal ranking with pricing.

similar documents