ThuFit

Report
第一届中国大数据技术创新与创业大赛
关键词行业分类
ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤
指导:方展鹏 ,唐杰
清华大学 未来互联网兴趣团队
Task

Given:




Partially labeled keywords
First 10 search results for each keywords
Keyword-buyer relationship
Goal:

Predict unlabeled keywords
Data summary

keyword_class.txt





11%
89%
keyword_users.txt



10,787,584 keywords
1,143,928 labeled, 10.6%
9,963,062 unique keywords
33 classes
keyword distribution
23,942,643 entries
Each entry is a keyword-buyer pair
labeled
unlabeled
keyword_titles.txt


21,575,166 entries, but only 10,787,583 entries are non-empty.
Each entry comprised of keyword and its first 10 search result
using Baidu
Approach

Preprocessing:


Feature Extraction:





Keyword segmentation
Keyword segment
Keyword-buyer relation
Keyword-segment relation
Search result utilization
Model:

liblinear
Keyword segmentation


Keyword 
Segement



Segmentation


A sub-string of a keyword
Semantic unit
Break a keyword to a set of segment
Two ways:

Exact segmentation

清华大学 => 清华/大学
  = [0 , 1 , 2 , … , −1 ]
Full segmentaion



清华大学 => 清华/大学/华大/清华大学
  = {0 , 1 , 2 , … , −1 }
 结巴中文分词:https://github.com/fxsjy/jieba

Feature Extraction - segment

Sparse representation of segments



Smoothened TFIDF-based feature
N-gram
“End-gram”
Feature Extraction - TFIDF
Just in this page: segment s = term 
  = {0 , 1 , 2 , … , −1 }
  ,  = |{ ′ |  ′ ∈   }|
 () = |  ∈  
∈
  ′ ,  = 1 + log (, )

||
= log
1+()
,  =  ′ , 

 t



Definition of () will be given later
()
Feature Extraction - N-gram

N-gram


To capture some structure information
Recall




2-gram




2  =  . +1  ∈ 0,1, . . ,  
 = 2   ∈ }
′ =  
s ∈   }
Limitation



There are two ways of segmenting a keyword
  = {0 , 1 , 2 , … , −1 }, a set
  = [0 , 1 , 2 , … , −1 ], an ordered list <- adopt this one
Large character set produce large keyword set
Noise
Reduced 2-gram


′ =   ∈ 
 ′ =  ′
   ℎ 5 }
Feature Extraction - End-gram

End-gram





  = 0 , 1 , 2 , … , −1
−1 is more likely to carry discriminative information
Emphasis on the last segment: append a character that did not appear in ,
e.g “漢”
Example

 = rnu209e.tvp2轴承

  = ["rnu209e", "2", "轴承"]

  = "轴承漢"

 ′′ =  ′ {()}

w =“hj系列双锥混合机市场调查报告”
Similarly we can define ()
Feature Extraction
 
 =
  2 
 
 Where is ()?
 Experiments showed that, when adding
(), performance slightly
degrades.
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
K0: C2
S0: C2
B0: C2 C3
K1:
S1: C3
B1:
K2:
S2:
B2:
K3: C3
S3: C2 C3
B3:
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
K0: C2 C0
S0: C2
B0: C2 C3
K1:
S1: C3 C3
B1: C0 C3
K2: C3
S2: C0
B2:
K3: C3
S3: C2 C3 C0 C3
B3:
Keyword-buyer relation
 Assumption: A user tends to by similar class of
keywords
 Obtain the distribution of classes of keywords a
buyer buys on labeled data.
 Each buyer has a 33-dimensioned feature vector
 For each keyword , its feature vectors is an
average over feature vector of a buyers that buys
this keyword.
 Using only this feature we get an accuracy of 0.82
Keyword-buyer relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Keyword-buyer relation
 We have made effort trying modeling buyers by
the segments of keywords they bought, and
model keywords-keywords relationship by
exploiting their common connection with
segments.

Buyer -> Keyword ->Segment =>Buyer -> Segment
 We further introduced higher order relation
influence between buyers and keywords, but
improvements are subtle.
Keyword-segment relation
 Reverse the link between segment and keywords

Keyword ->Segment => Segment -> Keyword
Keyword-segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Search Result Utilization

Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/



1-1828169-5: 1 1828169 5
1-1838143-0: 1 1838143 0
Their search results

1-1838143-0 1-1838143-0全国供货商【IC37旗下站】1-18381430价格|PDF ... IC芯片1-1838143-0品牌、价格、PDF参数 - 电子产
品资料 - 买卖IC网 PIC16C57-XT/SP145的IC、二极管、三极管查
询,采购PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0
2000pcs 1005+ 现货 泰科Tyco431829-1集成电路、连接器、接插件
AMP欧式背板连接器崧晔达_达价格_优质崧晔达批发/采购 - 阿里巴
巴
供应聚氯乙烯_连接器_供应聚 崧晔达价格_优质崧晔达批发/采
购 - 阿里巴巴
供应聚氯乙烯_连接器_供应聚氯乙烯批发_供应聚
氯乙烯供应_阿里巴巴 上海金庆电子技术有限公司
限位开关12
福州福铭仪器
Search Result Utilization
 For normal keywords, the keyword itself has
semantic meaning.
 For those keywords with less semantic
information, they are usually a product serial
number or some domain specific terminology ,
e.g chemical element names.
 These supplementary information yields more
accuracy results on “weird” keywords.
 But these keywords did not seem to be included
in online test.
Search Result Utilization

Recall:




  =  
2 
 
If we add one more term:
 ′  =
  2    (())
 where () is the search result of 
Performance decreased by noise introduced
Example


 = “hj系列双锥混合机市场调查报告”
  =“混合设备 HJ系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 ...混合机-供应HJ系列双锥混合机-混合机尽在阿里巴巴-常州欧朋干燥... HJ
系列双锥混合机厂家_价格-食品机械行业网
HJ系列双锥混合机供应信息,常州市步群干燥设备有限公司 HJ系列双锥
混合机_百度百科
HJ系列双锥混合机 - 常州普耐尔干燥设备有限公
司
HJ系列双锥混合机价格(江苏 常州)-盖德化工网...”
Feature Statistics
 Dimensionality: 200,000
 Lower dimensionality introduce better
generalization ability.
Implementation
Life is short, you need Python
Model
 Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
 A Library for Large Linear Classification
 L2-loss logistic regression
 33 one-vs-all classifiers for each class.
Experiments and Results
 We split labeled data into training and validation
set
 All following results are local results.
 Online test result are higher due to utilizing more
training data.
 Due to the complexity of migrating our code to
hadoop platform (mainly because we used third
party non-java libraries), not all of the features
above are employed in our final submission.
Experiments and Results
Feature vector
constituents
Accuracy
Keyword-buyer relation
0.8194
Keyword-segment relation
0.9019
Keyword-buyer +
(() + TFIDF)
0.9537
() + TFIDF
0.9656
  + 2  +
TFIDF
0.9635
  + 2  +
() + TFIDF
0.9725
  + 2  +
  +
() + TFIDF
0.9713
Analysis
 We split labeled data into training and validation
set
 All following results are local results.
 Online test result are higher due to utilizing more
training data.
 Due to the complexity of migrating our code to
hadoop platform (mainly because we used third
party non-java libraries), not all of the features
above are employed in our final submission.
Limitations
 Two types of feature
 Relation feature:




Utilized prior knowledge of class label information
Low dimension
May biased to training data
TFIDF feature:



No class label information utilized
High dimension
Robust, good generalization ability
But a simple combination of two does not work well
 Ensemble methods may workaround this problem.

Thanks!

similar documents