Report

Please interrupt me at any point! Online Advertising Open lecture at Warsaw University February 25/26, 2011 Ingmar Weber Yahoo! Research Barcelona [email protected] Disclaimers & Acknowledgments • This talk presents the opinions of the author. It does not necessarily reflect the views of Yahoo! Inc. or any other entity. • Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo! or any other company. • Many of the slides in this lecture are based on tables/graphs from the referenced papers. Please see the actual papers for more details. Review from last lecture • Lots of money – Ads essentially pay for the WWW • Mostly sponsored search and display ads – Sp. search: sold using variants of GSP – Disp. ads: sold in GD contracts or on the spot • Many computational challenges – Finding relevant ads, predicting CTRs, new/tail content and queries, detecting fraud, … Plan for today and tomorrow • So far – Mostly introductory, “text book material” • Now – Mostly recent research papers – Crash course in machine learning, information retrieval, economics, … Hopefully more “think-along” (not sing-along) and not “shut-up-and-listen” But first … • Third party cookies www.bluekai.com (many others …) Efficient Online Ad Serving in a Display Advertising Exchange Keving Lang, Joaquin Delgado, Dongming Jiang, et al. WSDM’11 Not so simple landscape for D A Advertisers “Buy shoes at nike.com” “Visit asics.com today” “Rolex is great.” Publishers A running blog The legend of Cliff Young Celebrity gossip Users 32m, likes running 50f, loves watches 16m, likes sports Basic problem: Given a (user, publisher) pair, find a good ad(vertiser) Ad networks and Exchanges • Ad networks – Bring together supply (publishers) and demand (advertisers) – Have bilateral agreements via revenue sharing to increase market fluidity • Exchanges – Do the actual real-time allocation – Implement the bilateral agreements Middle-aged, middle-income New Yorker visits the web site of Cigar Magazine (P1) D only known at end. User constraints: no alcohol ads to minors Supply constraints: conservative network doesn’t want left publishers Demand constraints: Premium blogs don’t want spammy ads Valid Paths & Objective Function Depth-first search enumeration Algorithm A Worst case running time? Typical running time? Algorithm B US pruning Upper bound Why? Worst case running time? Sum vs. product? Optimizations? D pruning Reusable Precomputation Cannot fully enforce D Depends on reachable sink … … which depends on U What if space limitations? How would you prioritize? Experiments – artificial data Experiments – real data Competing for Users’ Attention: On the Interplay between Organic and Sponsored Search Results Christian Danescu-NiculescuMizil, Andrei Broder, et al. WWW’10 What would you investigate? What would you suspect? Things to look at • General bias for near-identical things – Ads are preferred (as further “North”) – Organic results are preferred • Interplay between ad CTR and result CTR – Better search results, less ad clicks? – Mutually reinforcing? • Dependence on type – Navigational query vs. informational query – Responsive ad vs. incidental ad Data • One month of traffic for subset of Y! search servers • Only North ads, served at least 50 times • For each query qi most clicked ad Ai* and most clicked organic result Oi* • 63,789 (qi, Oi*, Ai*) triples • Bias? (Non-)Commercial bias? • Look at A* and O* with identical domain • Probably similar quality … • … but (North) ad is higher • What do you think? • In 52% ctrO > ctrA av. ctrO av. ctrA Correlation ctrO For given (range of) ctrO bucket all ads. ctrA av. ctrO av. ctrA Navigational vs. non-navigational ctrO ctrA Navigational: antagonistic effect Non-navigational: (mild) reinforcement Dependence on similarity Bag of words for title terms (“Free Radio”, “Pandora Radio – Listen to Free Internet Radio, Find New Music”) = 2/9 av. ctrA av. ctrA Dependence on similarity A simple model Want to model Also need: A simple model Explains basic (quadratic) shape of overlap vs. ad click-through-rate Improving Ad Relevance in Sponsored Search Dustin Hillard, Stefan Schroedl, Eren Manavoglu, et al. WSDM’10 Ad relevance Ad attractiveness • Relevance – How related is the ad to the search query – q=“cocacola”, ad=“Buy Coke Online” • Attractiveness – Essentially click-through rate – q=“cocacola”, ad=“Coca Cola Company Job” – q=*, ad=“Lose weight fast and easy” Hope: decoupling leads to better (cold-start) CTR predictions Basic setup • Get relevance from editorial judgments – Perfect, excellent, good, fair, bad – Treat non-bad as relevant • Machine learning approach – Compare query to the ad – Title, description, display URL – Word overlap (uni- and bigram), character overlap (uni- and bigram), cosine similarity, ordered bigram overlap – Query length • Data – 7k unique queries (stratified sample) – 80k query-ad judged relevant pairs Basic results – text only Precision = (“said ‘yes’ and was ‘yes’”)/(“said ‘yes’”) Recall = (“said ‘yes’ and was ‘yes’”)/(“was ‘yes’”) Accuracy = (“said the right thing”)/(“said something”) F1-score = 2/(1/P + 1/R) harmonic mean < arithmetic mean What other features? Incorporating user clicks • Can use historic CTRs – Assumes (ad,query) pair has been seen • Useless for new ads – Also evaluate in blanked-out setting Translation Model In search, translation models are common Here D = ad Good translation = ad click Typical model A query term Maximum likelihood (for historic data) Any problem with this? An ad term Digression on MLE • Maximum likelihood estimator – Pick the parameter that‘s most likely to generate the observed data Example: Draw a single number from a hat with numbers {1, …, n}. You observe 7. Maximum likelihood estimator? Underestimates size (c.f. # of species) Underestimates unknown/impossible Unbiased estimator? Remove position bias • Train one model as described before – But with smoothing • Train a second model using expected clicks • Ratio of model for actual and expected clicks • Add these as additional features for the learner Filtering low quality ads • Use to remove irrelevant ads - Don‘t show ads below relevance threshold Showing fewer ads gave more clicks per search! Second part of Part 2 Estimating Advertisability of Tail Queries for Sponsored Search Sandeep Pandey, Kunal Punera, Marcus Fontoura, et al. SIGIR’10 Two important questions • Query advertisability – When to show ads at all – How many ads to show • Ad relevance and clickability – Which ads to show – Which ads to show where Focus on first problem. Predict: will there be an ad click? Difficult for tail queries! Word-based Model Query q has words {wi}. Model q‘s click propensity as: Good/bad? Variant w/o bias for long queries: Maximum likelihood attempt to learn these: s(q) = # instances of q with an ad click n(q) = # instances of q without an ad click Word-based Model Then give up …each q only one word Linear regression model Different model: words contribute linearly Add regularization to avoid overfitting of underdetermined problem Problem? Digression Taken from: http://www.dtreg.com/svm.htm and http://www.teco.edu/~albrecht/neuro/html/node10.html Topical clustering • Latent Dirichlet Allocation – Implicitly uses co-occurrences patterns • Incorporate the topic distributions as features in the regression model Evaluation • Why not use the observed c(q) directly? – “Ground truth” is not trustworthy – tail queries • Sort things by predicted c(q) – Should have included optimal ordering! Learning Website Hierarchies for Keyword Enrichment in Contextual Advertising Pavan Kumar GM, Krishna Leela, Mehul Parsana, Sachin Garg WSDM’11 The problem(s) • Keywords extracted for contextual advertising are not always perfect • Many pages are not indexed – no keywords available. Still have to serve ads • Want a system that for a given URL (indexed or not) outputs good keywords • Key observation: use in-site similarity between pages and content Preliminaries • Mapping URLs u to key-value pairs • Represent webpage p as vector of keywords – tf, df, and section where found Goals: 1. Use u to introduce new kw and/or update existing weights 2. For unindexed pages get kw via other pages from same site Latency constraint! What they do • Conceptually: – Train a decision tree with keys K as attribute labels, V as attribute values and pages P as class labels – Too many classes (sparseness, efficiency) • What they do: – Use clusters of web pages as labels Digression: Large scale clustering Syntactic clustering of the Web, Broder et al., 1997 • How (and why) to detect mirror pages? – “ls man” What would you do? • Want a summarizing “fingerprint”? – Direct hashing won’t work Shingling w-shingles of a document (say, w=4) – “If you are lonely when you are alone, you are in bad company.” (Sartre) {(if you are lonely), (you are lonely when), (are lonely when you), (lonely when you are), …} Resemblance rw(A,B) = |S(A,w)ÅS(B,w)|/|S(A,w)[S(B,w)| Works well, but how to compute efficiently?! Obtaining a “sketch” • • • • Fix shingle size w, shingle universe U. Each indvidual shingle is a number (by hashing) Let W be a set of shingles. Define: MINs(W) = The set of s smallest elements in W, if |W|¸s W otherwise Theorem: Let : U!U be a permutation of U chosen uniformly at random. Let M(A) = MINs((S(A))) and M(B) = MINs((S(B))). The value |MINs(M(A)[M(B))ÅM(A)ÅM(B)|/|MINs(M(A)[M(B))| is an unbiased estimate of the resemblance of A and B. Proof Note: Mins(M(A)) has a fixed size (namely s). Back to where we were • They (essentially) use agglomerative single-linkage clustering with a min similarity stopping threshold Do you know agglomerative clustering? • Splitting criteria – How would you do it? Not the best criterion? • IG prefers attributes with many values – They claim: high generalization error – They use: Gain Ratio (GR) Take impressions into account • So far (unweighted) pages – Class probability = number of pages Weight things by impressions. More weight for recent visits: Stopping criterion • Stop splitting in tree construction when – All children part of the same class – Too few impressions under the node – Statistically not meaningful (Chi-square test) • Now we have a decision tree for URLs (leaves) – What about interior nodes? Obtaining keywords for nodes • Belief propagation – from leaves up …and back down down Now we have keywords for nodes. Keywords for matching nodes are used. Evaluation • Two state-of-the-art baselines – Both use the content – JIT uses only first 500 bytes, syntactical – “Semantic” uses topical page hierarchies – All used with cosine similarity to find ads • Relevance evaluation – Human judges evaluated ad relevance (Some) Results nDCG … slide Digression - nDCG • Normalized Discounted Cumulative Gain • CG: total relevance at positions 1 to p • DCG: the higher the better • nDCG: take problem difficulty into account An Expressive Mechanism for Auctions on the Web Paul Dütting, Monika Henzinger, Ingmar Weber WWW’11 More general utility functions • Usually – ui,j(pj) = vi,j – pj – Sometimes with (hard) budget bi • We want to allow – ui,j(pj) = vi,j – ci,j¢ pj, i.e. (i,j)-dependent slopes – multiple slopes on different intervals – non-linear utilities altogether Why (i,j)-dependent slopes? • Suppose mechanism uses CPC pricing … • … but a bidder has CPM valuation • Mechanism computes • Guarantees Why (i,j)-dependent slopes? • Translating back to impressions … Why different slopes over intervals? • Suppose bidding on a car on ebay – Currently only 1-at-a-time (or dangerous)! – Utility depends on rates of loan Why non-linear utilities? • Suppose the drop is supra-linear – The higher the price the lower the profit … – … and the higher the uncertainty – Maybe log(Ci,j-pj) – “Risk-averse” bidders • Will use piece-wise linear for approximation – Give approximation guarantees Input definition Set of n bidder I, set of k items J. Items contain a dummy item j0. Each bidder i has an outside option oi. Each item j has a reserve price rj. Problem statement • Compute an outcome • Outcome is feasible if • Outcome is envy free if for all i and (i,j) 2IxJ • Bidder optimal if for all other envy free and for all bidders i (strong!) Bidder optimality vs. truthfulness Two bidders i2{1,2} and two items j2{1,2}. rj = 0 for j2{1,2}, and oi = 0 for i2{1,2} What‘s a bidder optimal outcome? What if bidder 1 underreports u1,1(¢)? Note: “degenerate” input! Theorem: General position => Truthfulness. [See paper for definition of “general position”.] Main Results Definition: Overdemand-preserving directions • Basic idea – Algorithm iteratively increases the prices – Price increases required to solve overdemand • Tricky bits – preserve overdemand (will explain) – show necessity (for bidder optimality) – accounting for unmatching (for running time) Overdemand-preserving directions The simple case 1 10-p1 5-p3 Bidders 2 5 p1=1 9-p2 8-p1 7-p2 3-p3 Explain: First choice graph 1 2 p2=0 4 Items Explain: Increase required 12-p1 11-p2 3 2-p3 3 p3=0 Explain: Path augmentation Overdemand-preserving directions The not-so-simple case 1 11-2p1 5-p3 Bidders 2 1 p1=1 9-p2 8-4p1 4-3p2 2 p2=0 3-p3 Explain: ci,j matter! 12-3p1 9-7p2 3 2-p3 3 p3=0 Items Finding ov.d-preserv. directions • Key observation (not ours!): – minimize – or equivalently • No longer preserves full first choice graph – But alternating tree • Still allows path augmention The actual mechanism Effects of Word-of-Mouth Versus Traditional Marketing: Findings from an Internet Social Networking Site Michael Trusov, Randolph Bucklin, Koen Pauwels Journal of Marketing, 2009 The growth of a social network • Driving factors – Paid event marketing (101 events in 36 wks) – Media attention (236 in 36 wks) – Word-of-Mouth (WOM) • Can observe – Organized marketing events – Mentions in the media – WOM referrals (through email invites) – Number of new sign-ups What could cause what? • • • • Media coverage => new sign-ups? New sign-ups => media coverage? WOM referrals => new sign-ups? …. Time series modeling intercept sign-ups WOM referrals Media appearances Promo events Up to 20 days Lots of parameters linear trend holidays day of week Time series modeling Overfitting? Granger Causality • Correlation causality – Regions with more storks have more babies – Families with more TVs live longer • Granger causality attempts more – Works for time series – Y and (possible) cause X – First, explain (= linear regression) Y by lagged Y – Explain the rest using lagged X – Significant improvement in fit? What causes what? Response of Sign-Ups to Shock • IRF: impulse response function New to me … Digression: Bass diffusion New “sales” at time t: Ultimate market potential m is given. Model comparison • 197 train (= in-sample) • 61 test (= out-of-sample) Monetary Value of WOM • • • • • CPM about $.40 (per ad) Impressions visitor/month about 130 Say 2.5 ads per impression $.13 per month per user, or about $1.50/yr IRF: 10 WOM = 5 new sign-ups over 3 wk 1 WOM worth approx $.75/yr