You are on page 1of 14

Latent Topic Feedback for Information Retrieval

David Andrzejewski, David Buttler


Juan Gabriel Romero
Universidad Nacional de Colombia
May 31, 2013
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 1 / 14
The Problem
Corpus:
Document metadata limited
Specialized domain
Large corpus, small user base
The user can not formulate the right query
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 2 / 14
Solution
Obtaining user feedback at the latent topic level
Learn latent (unobserved) topics
Construct representations of these topics
Present potentially relevant topics to the user
Augment the original query
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 3 / 14
Latent Dirichlet Allocation
Figure 1: Blei, D. Sep 2009. Topic Models
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 4 / 14
Latent Dirichlet Allocation
Figure 2: Blei, D. Sep 2009. Topic Models
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 5 / 14
Latent Dirichlet Allocation
P(w, z, , |, , d)

t
p(
t
|)

j
p(
j
|)

z
i
(w
i
)(z
i
)

To infer z, and , run Markov Chain Monte Carlo (Gibbs sampling) and,

t
(w) n
tw
+

j
(t) n
jt
+
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 6 / 14
Topic representation
First, with k = 10,
W
t
= k argmax
w

t
(w)
label generation (Best topic word)
Description Score
Word probability f
1
(w) = P(w|z = t)
Topic posterior f
2
(w) = P(z = t|w)
PMI f
3
(w) =

W
t
\wPMI (w, w

)
Conditional 1 f
4
(w) =

W
t
\wP(w|w

)
Conditional 2 f
5
(w) =

W
t
\wP(w

|w)
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 7 / 14
Topic representation
ngram identication (Turbo Topics)

Most signicant trigram

Two most signicant bigrams

Four most signicant unigrams


capitalization
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 8 / 14
Topic selection
Top 2 documents considered relevants
Enriched topics:
E =

dD
q
k argmax
t

d
(t)
Related topics:
R =

tE
k argmax
t

/ E
(t, t

)
Filter topics:
PMI (t) =
1
k(k 1)

(w,w

)W
t
PMI (w, w

)
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 9 / 14
Query expansion
Add 10 most probable words in the topic W
t
to the query
With [0, 1] as weight parameter.
For N
q
the words in the original query, the weight is
(1)
N
q
The weight for each word from the selected topic, then is

t
(w),
with

t
representing the re-normalized topic-word probability:

t
(w) =

t
(w)

W
t

t
(w

)
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 10 / 14
Experiments
Questions:
Can query expansion with latent topic feedback improve the result of
actual queries?
Assuming there are latent topics, will the topic selection described
present them to the user?
If presented with a helpful topic will a user actually select it?
(Outside the scope)
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 11 / 14
Experimental setup
Data set from TREC
MALLET
Preparation: Downcasing; removal of numbers, punctuacion marks;
stop words; lter rarely occuring words
Vocabularies between 10,000 and 20,000
Gibbs inference run 1,000 times re-estimating each 25 samples
500 topics
= 0.25
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 12 / 14
Results
Mean Average Precision, Normalized Discounted Cumulative Gain,
NDCG15
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 13 / 14
Results
For 40% of queries exist a latent topic that can enhance results
For 40% of these queries the approach nds relevant topics
Changes in the technique give worst results

Without ltering: Increase in the number of topics without substantial


increase in helpful topics retrieved

Excluding related topics: Decrease in the number of topics and the


helpful topics presented
Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 14 / 14

You might also like