Report

Author: N. Sarkas et al. SIGMOD 2010 Motivation Approach Query Processing Probabilistic Model 2 Do keyword search on structured database ◦ The user leverages the experience on using search engine to search records in a database Problems ◦ Keyword matching may not get any result ◦ Misinterpretation of the user’s intent Q=“white tiger” T=Shoes, Color=white, Shoe Line=tiger T=Book, title=white tiger The animal white tiger (not in the database) ◦ Provide fast response (in ms) for web users Not efficient if the system queries every database 3 Annotation ◦ Annotation token for a table (AT) = (t, T.A) Q=50 inch LG (LG,TVs.Brand), (LG,Monitors.Brand),(50 inch,TVs.Diagonal) ◦ All tokens (nominal, ordinal and numerical) are annotated on the table. Accept numerical tokens in a certain range TVs Monitors Type Brand Diagonal Type Brand Diagonal TV Samsung 46 inch Monitor Samsung 19 inch TV LG 50 inch Monitor LG 24 inch 4 Generate structured annotation for the keywords Find the maximal annotation Annotation scoring 5 Generate structured annotation for a query Keyword: K1, k2 (T1, {(k1,T1.Att1)(…)}, {free token}) (T2, {(k1,T2.Att1) (…)}, {}) Free token: a query keyword not associated with an attribute Example “50 inch LG lcd” (TVs, {(50 inch, TVs.Diagonal), (LG, TVs.Brand), (lcd, TVs.screen)}, {}) (Monitors, {(50 inch, Monitors.Diagonal),(LG, Monitors.Brand), (lcd, Monitors.Screen)}, {}) (Refrig, {(50 inch, Refrig.Width), (LG, Refrig.Brand)}, {lcd}) 6 Find the maximal annotation ◦ Given a table, we want more annotation tokens, less free tokens ◦ Annotation S = (T, AT, FT) there’s no S’ = (T, AT’, FT’) s.t. AT’ > AT, FT’ < FT AT: annotated token FT: free token ◦ Example S1=(TVs, {(LG, TVs.Brand), (lcd, TVs.screen)}, {}) S2=(TVs, {(LG, TVs.Brand), {lcd}) S3=(TVs, {(50 inch, TVs.Diagonal), (LG, TVs.Brand), {lcd}) S2 is not maximal 7 Scoring annotation ◦ Intuition Query: LG 30 inch screen Want: TV, monitor Dislike DVD Player There’s no DVD player with a screen in the database People don’t query size of a DVD player Cell phone The size of the screen is significantly smaller in the database ◦ A probabilistic model is chosen for the scoring 8 Generative probabilistic model ◦ If the user searches a table, what are the words that the user may use (the probability of each word)? ◦ P(T.Ai): the probability that the user search table T and the subset of the attributes T.Ai ◦ Given attributes, users select tokens with probability = { , ℱ } . Å ) ∙ (. Å ) Å: the attributes of table T + free tokens Example “LG 30 inch screen” = , 30 ℎ , , , ∙ . , . , . Need to simplify the equation 9 Assumption 1 Annotated and free tokens are independent , ℱ . Å ) = . Å ∙ (ℱ |. Å ) Assumption 2 The user depends on the table to choose free tokens ℱ . Å = ℱ The user depends on the attributes of the table to choose annotated tokens, not on free tokens . Å = . ) = ( |. ) ∙ ∙ (. Å ) ……………………(2) Si: given query q, the annotation Sq=S1,…,Sk 10 The equation (2) assumes that all queries are targeting some table in the data collection. ◦ Not true. Ex: Q=“green apple”. Annotation: green=color, apple = brand. Could green apple mean a fruit? ◦ Approach Open Language Model (OLM) table: capture open-world queries (ex: the log of Bing ) Sq={S1,…,Sk, Sk+1}, where Sk+1=SOLM. SOLM=(OLM,{FTq}) = () ……………………. (3) ( ) >>0 To keep plausible annotation 11 We have two probabilistic models ◦ = ( |. ) ∙ ℱ ∙ (. Å )……………(2) ◦ = ℱ () …………………. (3) What’s next? ◦ Maximize the probability Simplify the equation when necessary ◦ Build up a system which is based on the model 12 Thank You! 13 Consider a query from web log ◦ It can either be formulated by an annotation or is a free-text query. = ∈ + ( ) ( |. ) ∙ ℱ ∙ (. Å ) = = ∈ ∈ . ℱ , Å + () + 0 = ℎ = ℎ ℎ ℎ * ℱ ℱ = ∗ + 1 − ∗ ( |) UMT=all possible names and values that can be associated with table T = confidence level Ex: FT=computer. T1=Monitor, T2=TV 14 Given Observed data: web query log Model: = To Find =1 ∈ + 0 and 0 that maximize the likelihood ℒ = log ( =1 ∈ + 0 ) + 0 = 1 15 Initial step Repeat ◦ Select initial , 0 ◦ Expectation step Based on current , 0 , estimate , +1 +1 = = … … ◦ Maximization step Based on current , , maximize , 0 +1 = 0+1 = +1 || +1 || 16 Thank You! 17 ( |. ) = ( .) ◦ The fraction of the entries in a table T that take the values ATi. ◦ Q=“50 inch LG lcd” S = (TVs, {(LG,TVs.Brand),(50 inch, TVs.Diagonal),{lcd}). T.A = {Brand, Diagonal} T(AT.V) = all records in TV of brand LG with diagonal size 50 inch. [offline] A mapping from the value to the number of matched records in the table 18 Maximum likelihood estimation ◦ Given Likelihood A set of observed data { , …} 1 A proposed model (|) I were ◦IfTo find to flip a fair coin 100 times, what is The parameter of that the likelihoodevery the probability it maximize landing heads-up Rephrase time?“ ◦ Given Observed data: web query log Given that I have flipped a coin 100 times and it Model: = =1 ∈ + 0 landed heads-up 100 times, what is ◦has To Find the the coin is fair?“ℒ = likelihood and 0 that that maximize the likelihood ∈ log () EM algorithm can be used to solve ℒ (source: wikipedia) 19 An iterative algorithm with 2 steps Expectation step ◦ Estimate parameters Maximization step ◦ Calculate expected value Q of the log likelihood function ◦ Find parameter +1 that maximize Q. 20