Professional Documents
Culture Documents
The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since the concept of
a set is quite intuitive, the Boolean model pro-vides a framework which is easy to grasp by a common user of
an IR system.
Furthermore, the queries are specified as Boolean expressions which have precise
commercialbibliographicsystems.
26 MODELING
Figure2.3 The three conjunctive components for the query[q =k/\ (kb V...,kc)J.
thesedrawbacks,theBooleanmodel is stillthedominantmodelwithcommercial
the field.
not,and,or.Thus,aqueryisessentiallyaconventionalBooleanexpressionwhich
Definition For the Boolean model, theindex term weight variables are all
is defined as
CLASSICINFORMATION RETRIEVAL 27
Ifsim(dj
is relevant
to the query q (it might notbe). Otherwise, thepredictionis that the document
is not relevant.
TheBooleanmodelpredicts that each document is either relevant or non-relevant. Thereis no notionof a partial
matchto the queryconditions. For
, dj )
is positive and non-binary. Further, the index terms in the query are also
,q], whereWi,q~O.
d~.if
Id~1 x1q1
E:-IWi,jxWi,q
28 MODELING
, q).
SinceWi,j~0andWi,q~0,sim(q,dj
basicprincipleswhichsupportclusteringtechniques, asfollows.
for a set Aof cars which have a pricecomparableto thatof a Lexus 400. Since it
unique)descriptionoftheset A.Moresophisticatedclusteringalgorithmsmight
CLASSICINFORMATION RETRIEVAL 29
in the set Aand which ones are not (i.e., the IR problemcan be viewedasa
the collection C. The first set of features provides forquantificationofintra-cluster similarity, whilethe second
set of features provides forquantification
. Suchterm frequency is
row frequency ofterm k; in the documentdj (i.e., the numberoftimes the term
.. - freqi,j
.) -, maxI freql,j
(2.1)
of the documentdj. Ifthe term ki does not appear in the documentdj then
i, be given by
(2.2)
(2.3)
30 MODELING
schemes.
SeveralvariationsoftheaboveexpressionfortheweightWi,jaredescribedin an
manycollections.
_ (0 0.5freqi,q) 1 N
(2.4)
wherefreqi,q is the raw frequency ofthe term k; in the text ofthe information
request q.
Themainadvantagesofthevectormodelare: (1)itsterm-weightingscheme
ofdocumentsthat approximatethe queryconditions; and (3) its cosine rank-ing formula sortsthe
documentsaccordingto their degree ofsimilarity to the
hurtthe overallperformance.
CLASSICINFORMATION RETRIEVAL 31
documents. Thus,we canthink ofthe queryingprocess as a process of specify-ing the propertiesof an ideal
answer set (which isanalogousto interpreting the
guessing what they could be. Thisinitial guess allows us togeneratea prelim-inary probabilisticdescriptionof
the ideal answer set which is used to retrieve
ofthe ideal answer set. By repeating this process many times, it isexpected
that such adescriptionwill evolve and become closer tothe real descriptionof
the ideal answer set. Thus,one should always have in mindtheneed to guess at
the user will find the documentdj interesting (i.e., relevant). Themodel as-sumesthatthisprobabilityof
relevancedependsonthequeryandthedocument
documentswhich the user prefers asthe answer set forthe queryq. Such an
, as
anerroneousjudgement [282,785J.
Definition For the probabilistic model, theindex term weight variables are
32 MODELING
P(RIJ.)
sim(d q)= .3
], P(Rldj
sim(d. )= P(~IR)xP(R)
]'q P(djIR)xP(R)
thesetRofrelevantdocuments.Further,P(R)standsfortheprobabilitythata
we write,
P(d~IR) I"V
P(d~IR)
index term k;is not presentin a documentrandomly selected from the set R.
to theonesjust described.
factors which are constantfor all documentsin the contextof the same query,
CLASSICINFORMATION RETRIEVAL 33
In the verybeginning(i.e.,immediatelyafterthequery specification),there
P(kiIR)
P(kiIR)
= 0.5
index term ki and Nis the total numberofdocumentsin the collection. Given
is improved as follows.
ki
these sets (it should always be clear when theused variable refers totheset or to
ofthe index term ki among the documentsretrieved so far, and (b) we can
P(kiIR) =
N-V
conceived.
P(kiIR)
Vi+0.5
vri
P(kiIR)
ni- \'i+0.5
N-V+l
34 MODELING
Vi+W
v-.i
ni-l!i+N
N-V+l
Thiscompletesourdiscussionofthe probabilisticmodel.
Themainadvantageofthe probabilisticmodel, in theory, isthat docu-mentsareranked in decreasingorderoftheir
probabilityof being relevant. The
into relevant and non-relevantsets; (2) the fact that the methoddoesnottake
(i.e., allweights arebinary);and(3) the adoptionofthe independence assump-tion for index terms. However,
asdiscussedfor the vector model, it is not clear