IR Project

Report
Improved
TF-IDF Ranker
Presentation By,
Muralidhar Chouhan
Contents
• Introduction
• Outline of our approach
• Background
o Tf-Idf ranker
o Semantic similarity between sentences
• Details of our approach
• Results
• Conclusion
• References
Introduction
•
Traditional information retrieval systems are particularly susceptible to
all the problems posed by the richness of natural language.
•
In particular multitude of ways in which the same concepts can be
described.
•
Overall context of the user input and the document is ignored.
•
Traditional TF IDF Ranker ignores the relatedness of concepts.
Searches for the exact word match.
•
Introduction of semantic analyzer will improve the performance.
Introduction (cont..)
•
Aim of the project is to use traditional TF IDF ranker along with
semantic analyzer to retrieve the documents. And to compare the
performance of the new system with the traditional tf idf ranker.
Introduction (cont..)
•
This project uses,
o
Text Retrieval Conference (TREC) data set named Confusion track for
validation [6].
o Wordnet lexical database
o .NET framework (wordnet .net)
Outline of our approach
Input Query
Documents
Pre-processor
Primary filter
Document
s
TF IDF Ranker
Doc ID, Weight pairs
Final Docs
Traditional TF IDF Ranker
Outline of our approach (cont..)
Input Query
Document
s
Pre-processor
Primary filter
Document
s
TF IDF Ranker
Doc ID, Weight pairs
Final Docs
TF IDF Ranker with introduction of Semantic knowledge
Semantic
similarity
Outline of our approach (cont..)
Docs got from
traditional tf idf
approach
Input Query
Documents
Pre-processor
TF-IDF Ranker
II
DocID, Keywords
Corpus
Word,DF pairs
•
•
Wordnet semantic Analyzer
Doc ID, Semantic score
Final Docs
Find the Keywords
from each doc
Use Tf and Df (use
Corpus)
Outline of our approach (cont..)
Pre-processor
Tokenize
Remove stop words
Background
Tf-Idf ranker:
•
Tf-idf ranker is used as a weighting factor in information retrieval and
text mining.
•
Terms that appear often in a document should get high weights.
•
The more often a document contains a term, the more likely that the
document is about the term. It is captures using Term frequency (TF).
•
Terms that appear in many documents should get a low weight,
which is captured using Inverse Document Frequency (IDF).
•
The weight of a term in a document is calculated using below
formula [5],
Wi,j=TFi,j * log (N/DFi)
Semantic similarity between sentences:
•
Semantic similarity between sentences is calculated using semantic
information and the word order information.
•
This project has used an implementation which calculates the
semantic relatedness between two set of strings.
•
The implementation uses Wordnet lexical database, to calculate the
semantic relatedness.
•
The score lies between 0 and 1. 0 representing least similarity score.
1being highest.
Wordnet:
•
Wordnet is the product of a research project at Princeton University
[4].
•
Information in Wordnet is organized around logical groupings called
synsets.
•
Each synset consists of a list of synonymous word forms and semantic
pointers that describe relationships between the current synset and
other synsets.
•
In Wordnet, each part of speech words (nouns/verbs...) are
organized into taxonomies where each node is a set of synonyms
(synset) represented in one sense.
Wordnet (cont..)
•
If a word has more than one sense, it will appear in multiple synsets at
various locations in the taxonomy.
•
Wordnet defines relations between synsets and relations between
word senses. A relation between synsets is a semantic relation, and a
relation between word senses is a lexical relation.
Wordnet (cont..)
•
•
•
For example,
The shortest path between male and female in Fig. 1 is male-personfemale, the minimum path length is 2.
The minimum path length between female and teacher is 5.
Details of our approach
Traditional TF-IDF Ranker
Step1:Preprocess input query
o Tokenization
o Remove stop words
Step2: Apply Tf-Idf ranker
•
TF-Idf ranker would identify number of times each word appears in
each of the documents as shown below.
D1
D2
W1
TF11
W2
W3
D3
,,
DN
DF
TF12
TF1N
DF1
TF21
TF22
TF2N
DF2
TF31
TF32
TF3N
DF3
TFn1
TFn2
TFnN
DFn
:
:
Wn
•
•
Where TFij is the term frequency of word wi in document Dj.
DFi indicates document frequency of word Wi in document collection
Details of our approach(cont..)
Calculating the weight:
•
The weight of each word is calculated using below formula.
Wi,j=TFi,j * log (N/DFi)
D1
D2
W1
W11
W2
DN
DF
W12
W1N
DF1
W21
W22
W2N
DF2
W3
:
:
W31
W32
W3N
DF3
Wn
Wn1
Wn2
WnN
DFn
S1
S2
SN
Weight
sum
D3
,,
Details of our approach(cont)
Step3 : Retrieve the documents
Sort all the documents according to the weights. Pick top Q documents
for further processing. Q is chosen such as the weight of each document
crosses a particular threshold d1.
Improved TF-IDF Ranker
Step1: We choose top S from the step3 of previous method. Here we use
another threshold d2(d2<d1) to get the set of docs for further processing.
Step2: Extract the keywords (Words which have high TF and low DF) from
each document.
Doc
DF
Weight
W1
TF1
DF1
We1
W2
TF2
DF2
We2
W3
TF3
DF3
We3
TFn
DFn
Wen
:
:
Wn
Details of our approach(cont)
Corpus containing IDF (logN/DF) of each word from docs
Details of our approach(cont..)
Step 3: For each document, calculate the semantic similarity score
between its keyword set and the input query.
Step 4: Sort the docs w.r.t to score. Eliminate the docs with score less
than a specified threshold (b=0.5).
Step 5: Display the docs.
Results
Confusion Track result set
Results(cont..)
Results: Old system vs New system
Results(cont..)
Calculating precision & recall for 10 queries
Results(cont..)
TF IDF( P)
1.2
TF IDF (R)
1
Semantic( P)
Semantic(R)
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
Precision& Recall bar chat: Old system vs New system
Screenshots
Traditional IF IDF Ranker
Screenshots(cont..)
Improved IF IDF Ranker(with semantic knowledge)
Conclusion
•
This project has improvised traditional TF-IDF ranker by introducing
Semantic analyzer.
•
Successfully showed that, using semantic analyzer has good
precision and recall values.
•
Next, it used a dataset from Text Retrieval Conference Data (TREC) to
validate the project.
•
One limitation of Tf-Idf Ranker is, terms that occur in query input text
but that cannot be found in documents gets zero scores.
References
[1] R. Rada, H. Mili, E. Bichnell, and M. Blettner, “Development and
Application of a Metric on Semantic Nets,” IEEE Trans. System, Man, and
Cybernetics, vol. 9, no. 1, pp. 17-30, 1989.
[2] Li, Yuhua,et.al, “Sentence Similarity Based on Semantic Nets and Corpus
Statistics,” IEEE Trans on knowledge and data engineering, vol 18, no.8,2006.
[3] Dao, Thanh, Troy Simpson, “Measuring similarity between the sentences”
.Web.
[4] R. Richardson, A. F. Smeaton and J. Murphy, “Using WordNet as a
Knowledge Base for Measuring Semantic Similarity between Words,” School
of Computer Applications, Dublin City University.Web.
[5] TfIdf Ranker, ‘http://vetsky.narod2.ru/catalog/tfidf_ranker/’ .web.
[6] Confusion track, TREC dataset
‘http://trec.nist.gov/data/t5_confusion.html’ .Web.
Thank you 

similar documents