Professional Documents
Culture Documents
Approved by:
Rada Mihal
ea
(B.S., Te
hni
al University of Cluj-Napo
a, 1997)
May 15, 1999
ACKNOWLEDGMENTS
I want rst of all to thank my advisor, Dr. Dan I. Moldovan, for his
ontinuous
support and help during my graduate study, and for his valuable advi
es in writing
this thesis. I want to thank also the fa
ulty, sta and students from the Computer
S
ien
e Department for their help and support.
iii
Spe
i
ally, we des
ribe a system that addresses two issues: (1) the translation
of a natural language question or senten
e into a query and query extension using
WordNet and (2) extra
tion of paragraphs that render relevant information from the
do
uments fet
hed by the sear
h engines.
TABLE OF CONTENTS
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
ix
CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. BACKGROUND ON RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1. WordNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2. SemCor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TREC topi s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4. An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5. Evaluation and
omparison with other methods . . . . . . . . . . . . . . . . . . . . 20
3.5.1. Tests against SemCor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2. Comparison with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6.1.
LIST OF FIGURES
Figure
Page
:::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::
13
viii
LIST OF TABLES
Table
Page
::::::::::::::::::::
3.1. Statisti s gather from the Internet for 384 word pairs
::::::::::::::::::
17
3.2. Values used in
omputing the
on
eptual density and the
on
eptual
density Cij : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
3.3. Final results obtained for 384 word pairs using both algorithms. : : : : : : : : 20
3.4. CD for bomb- pairs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
3.5. CD for
ause- pairs
:::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::
23
24
::::::::::::::::::::::::
36
4.2. A sample of the results obtained for randomly sele
ted questions from
the TREC and the REAL sets. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38
4.3. Summary of results for the 100 test questions. In ea
h box, the
top number indi
ates the number of do
uments retrieved, and
the bottom number indi
ates the relevant do
uments in top 10
ranking, and respe
tively the total number of relevant paragraphs.
::
39
4.4. Pre
ision and resolution for sear
hes performed with AltaVista for
the 100 test questions. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40
5.1. Time measurements for the disambiguation method : : : : : : : : : : : : : : : : : : : : 44
ix
To my family.
CHAPTER 1
INTRODUCTION
Word Sense Disambiguation (WSD) is an open problem in Natural Language
Pro
essing (NLP). There are several other NLP elds, like information retrieval,
oheren
e, dis
ourse, whi
h
an benet from an a
urate WSD method.
WSD methods
an be broadly
lassied into three types:
1. WSD that makes use of the information provided by ma
hine readable di
tionaries: this is the
ase with the work reported by Cowie [12, Miller [31, Aggire
[1, Li [23 and M
Roy [26;
2. WSD that uses information gathered from training on a
orpus that has already
been semanti
ally disambiguated (supervised training methods); su
h methods
have been implemented by Gale [15, Ng [35;
3. WSD that use information gathered from raw
orpora (unsupervised training
methods); Yarowski [45 and Resnik [37 presented unsupervised WSD methods.
There are also hybrid methods that
ombine several sour
es of knowledge su
h
as lexi
on information, heuristi
s,
ollo
ations and others: M
Roy [26, Bru
e [7, Ng
[35 and Rigau [39.
Statisti
al methods produ
e high a
ura
y results for small number of presele
ted words. A la
k of widely available semanti
ally tagged
orpora almost ex
ludes
supervised learning methods. A possible solution for automati
a
quisition of sense
tagged
orpora has been presented in [28, but the
orpora a
quired with this method
has not been yet tested for statisti
al disambiguation of words. On the other hand, the
disambiguation using unsupervised methods has the disadvantage that the senses are
1
not well dened. None of the statisti
al methods disambiguate adje
tives or adverbs
so far.
The WSD method proposed in this thesis attempts to disambiguate all the
nouns, verbs, adje
tives and adverbs in a text, using the senses provided in WordNet
[13. To our knowledge, there is only one other method, re
ently reported, that
disambiguates unrestri
ted words in texts [42.
It is based on the idea of word-word dependen
ies. The senses of the words are
determined by the
ontext in whi
h they appear. Given the senten
es \
an you spell
the rst letter" and \he wrote a letter to the editor", the word \letter" has dierent
meanings: in the rst example, its sense is that of \alphabeti
al letter", while in the
se
ond example it means \a written message". The
orre
t sense
an be sele
ted if
we
onsider the other words within the given
ontext, and then determine the most
likely
ombinations of word senses.
Our approa
h for WSD is to disambiguate the words based on their o
urren
e
within word-word pairs. It is a
tually a hybrid method whi
h
ombines two sour
es
of information. In the rst step, the Internet is used to gather statisti
s for word
o-o
urren
es. The ranking of senses provided by this rst step will be used as input
to the se
ond algorithm: the semanti
density of a word pair is
al
ulated using the
information provided by WordNet hierar
hies.
With this method, we a
tually rank the possible
orre
t senses of the words.
This
an be parti
ularly useful for tasks su
h as information retrieval. Consider, for
example, the senten
e He has to investigate all the reports. Even the phrase may
be ambiguous due to the small
ontext, still the results of a sear
h based on su
h a
senten
e
an be improved if the possible asso
iations between the senses of the verb
related to the original keywords, whi
h in
reases the number of possible relevant
do
uments found by a sear
h based on the input question.
A main problem with the
urrent sear
h engines is that many of the do
uments
retrieved for general queries are totally irrelevant to the subje
t of interest while
relevant do
uments may be missing be
ause the query does not
ontain the exa
t
keywords. A more rened query, with more restri
tive boolean operators, may result
in a few or even no do
uments.
This
alls for a natural language interfa
e that transforms senten
es into queries
with boolean operators
urrently a
epted by the sear
h engines. One thing that
an
be highly bene
ial is to sear
h not only for the words that o
ur in the input senten
e,
but to
reate similarity lists with words from on-line di
tionaries that have the same
meaning as the input words. This
an signi
antly broaden the web sear
h.
Another area of improvement is to design new retrieval operators that retain
more relevant information. A possible approa
h along this dire
tion is still to use the
existent sear
h engines, whi
h were developed with great eorts, but to post-pro
ess
the set of do
uments produ
ed to a query.
Figure 1.1 shows the way in whi
h an Internet sear
h system
an be improved
by making use of NLP te
hniques. The gure presents a
omparison between two
ar
hite
tures of Internet sear
h systems: the one to the left side of the gure represents
the basi
s
heme
urrently used by the sear
h engines; the one to the right side of
the gure des
ribes an Internet sear
h system in whi
h NLP te
hniques are involved.
The NLP stages and resour
es are represented as dark-ba
kground re
tangles.
Given a natural language question, the improved system will perform two tasks:
It determines the senses of the keywords in the input question. Using the
information from WordNet, a similarity list is
reated for ea
h keyword, whi
h
in
lude words semanti
ally related to it. These lists are used for query extension,
whi
h is intended to in
rease the number of relevant do
uments found during a
sear
h.
3
English sentence
WORDNET
LEXICAL PROCESSING:
(POS tagging, WSD)
Keywords Xi
Glosses from
WordNet
Keywords Xi
BOOLEAN OPERATORS
LEXICAL RESOURCES
BOOLEAN OPERATORS
Query
Query
SEARCH ENGINE
SEARCH ENGINE
Documents (ranked)
SEARCH
ENGINES
(AltaVista)
Documents (ranked)
LEXICAL OPERATORS
Relevant paragraphs/sentences
BASIC INTERNET
SEARCH SYSTEM
INFORMATION
RETRIEVAL RESOURCES
Figure 1.1. A
omparison between two Internet sear
h systems ar
hite
tures: (left)
the basi
Internet sear
h system
urrently used by sear
h engines, and (right) an
Internet sear
h system improved with NLP te
hniques. The dark re
tangles represent
stages in the sear
h system performed with NLP methods.
The extended query is then used as an input to the
urrent sear
h engines to
retrieve do
uments from the Internet. These do
uments are further pro
essed:
new lexi
al operators are applied in order to fet
h the relevant information.
This step in
reases the pre
ision of the sear
h.
The thesis is organized as follows: Chapter 2 presents the resour
es used by our
Word Sense Disambiguation method, as well as the resour
es involved in the Internet
sear
h system. The two kind of resour
es, namely lexi
al resour
es and information
retrieval resour
es, are presented in the se
tions 2.1 and 2.2.
Next, the Word Sense Disambiguation method is des
ribed in Chapter 3: se
tion 3.1 des
ribes our approa
h for WSD; se
tions 3.2 and 3.3 present the two
algorithms involved in this method; an example is presented in se
tion 3.4, and
then the results obtained are summarized in se
tion 3.5. Possible extensions of this
method are dis
ussed in se
tion 3.6.
An appli
ation of this WSD method is to improve the quality of the sear
h on
the Internet: Chapter 4 des
ribes su
h a natural language interfa
e for sear
hing the
Internet. We rst des
ribe the problem en
ountered while using the
urrent sear
h
engines; a solution to this problem is to make use of natural language te
hniques: this
improved sear
h system is des
ribed in se
tion 4.4. An example for the usage of this
system and the results we obtained are presented in se
tions 4.5 and 4.6. Previous
work in this eld is summarized in se
tion 4.2.
Finally, the
on
lusions regarding the work des
ribed in this thesis are presented
in Chapter 5.
CHAPTER 2
BACKGROUND ON RESOURCES
Several resour
es have been used in developing and testing the system des
ribed
in this thesis. We
an
ategorize these into:
1. Lexi
al resour
es, in
luding:
(a) WordNet, a Ma
hine Readable Di
tionary, in
luding the glosses dened in
this di
tionary;
(b) SemCor, a semanti
tagged
orpus;
(
) Brill's part of spee
h tagger.
2. Information Retrieval resour
es
(a) AltaVista, a sear
h engine for the Internet;
(b) TREC, the Text Retrieval Conferen
e, whi
h provides the resear
hers with
a set of topi
s with the purpose of testing systems designed for information
retrieval.
2.1. Lexi
al resour
es
The se
ond step of the WSD method proposed here makes use of the information
provided by
WordNet
glosses
in order to distin-
guish between word senses. To test the a
ura
y a
hieved in disambiguating words,
we used
SemCor
Within the sear
h system (as presented in
hapter 4) we need to pro
ess a
natural language question in order to identify its keywords and the main
on
epts.
To do that, we need to know the part of spee
h of the words in the input senten
e;
this part of spee
h tagging is done using a version of Brill's Part of Spee
h Tagger
[6. During the query expansion phase, we are also using WordNet.
2.1.1. WordNet
WordNet is a Ma
hine Readable Di
tionary developed at Prin
eton University
by a group led by George Miller [30, [13. It is used by our system for WSD and
generation of similarity lists for query extension.
WordNet
overs the vast majority of nouns, verbs, adje
tives and adverbs from
the English language. The words in WordNet are organized in synonym sets,
alled
Figure 2.1 presents a snapshot from the WordNet semanti hierar hies.
(carnivore)
(fissiped mammal, fissiped)
(wolf)
(canine, canid)
(wild dog)
(dog)
(hyena, hyaena)
(hnting dog)
(dachshund, dachsie, badger dog)
(working dog)
(terrier)
(procynoid)
(Syrian bear...)
(grizzly...)
(police dog)
denition, omments and examples. For example, the gloss of the synset finterest,
interestingnessg is (the power of attra
ting or holding one's interest (be
ause it is
unusual or ex
iting et
.); "they said nothing of great interest"; "primary
olors
an
add interest to a room"). It has a denition the power of attra
ting or holding one's
interest, a
omment be
ause it is unusual or ex
iting et
. and two examples: they
said nothing of great interest and primary
olors
an add interest to a room. Some
glosses
an
ontain multiple denitions or multiple
omments.
2.1.2. SemCor
SemCor [32 is a textual
orpus in whi
h every word is linked to its appropriate
sense in WordNet. Thus it
an be viewed either as a
orpus in whi
h words have been
tagged synta
ti
ally and semanti
ally, or as a lexi
on in whi
h example senten
es
an
be found for many denitions. The texts used to
reate the semanti
on
ordan
es are
extra
ted from the Brown Corpus and then linked to senses in the WordNet lexi
on.
The semanti
tagging was done by hand, using various tools to annotate the text with
WordNet senses.
8
The semanti
ally tagged les are grouped into three semanti
on
ordan
es
based on what was tagged and when. Ea
h semanti
al
on
ordan
e is stored in a
separate dire
tory, as shown in Table 2.2.
Table 2.2. The semanti
on
ordan
es in SemCor 1.6
Name
\brown1"
\brown2"
\brownv"
Content
103 Brown
orpus les
83 Brown
orpus les
166 Brown
orpus les
What's tagged
all open
lasses
all open
lasses
verbs
The semanti
ally tagged data is
odied using SGML. The SGML mark-up
language has pairs of form attribute=value to spe
ify: part of spee
h, word sense,
paragraphs, senten
es, et
. Figure 2.2 presents a fragment from a SemCor le.
say, and its sense, based on WordNet di tionary, as being sense #1.
AltaVista
used by our sear h system to fet h do uments whi h potentially in lude relevant
10
information respe
t to the input question. We then tested our retrieval system using
50 questions derived from the topi
s provided at the 6th Text Retrieval Conferen
e
(TREC-6).
2.2.1. AltaVista
AltaVista [3 is a sear
h engine developed in 1995 by the Digital Equipment
Corporation in its Palo Alto resear
h labs. There are several
hara
teristi
s of this
sear
h servi
e that makes AltaVista one of the most powerful sear
h engines. In
hoosing AltaVista for use in our system, we based our de
ision on two of its features:
(1) the size of information on Internet that
an be a
essed through AltaVista: it has
a growing index of over 160,000,000 unique World Wide Web pages; (2) it a
epts
omplex boolean sear
hes through its advan
ed sear
h fun
tion. These features make
this sear
h engine suitable for the development of software around it, with the goal
of in
reasing the quality of the information retrieved.
Spe
i
relationships
an be
reated among the keywords of a query a
epted
by AltaVista. These relations
an be
reated using bra
kets,
N E AR
operators.
AN D
AN D
OR
N OT
and
or phrases. Mary AN D lamb nds do uments with both the word Mary and the word
lamb.
Mary
OR
OR
N E AR
nds do uments
lamb nds do uments ontaining both the word Mary and the word lamb
but with the restri
tion that these words are separated by maximum 10 other words.1
Our main
on
ern when we de
ided to rely on AltaVista for sear
hing do
uments
on Internet, regarded the reliability of this sear
h engine. The number of hits obtained
for a given query should vary only within a small range for sear
hes performed at
1
http
11
various intervals of time. For the purpose of testing the reliability of Altavista, we
onsidered a set of 1,100 words (nouns, verbs, adje
tives and adverbs); the set was
built from one of the texts in the Brown
orpus. A test run
onsisted of sear
hing
on Internet using Altavista, for ea
h of these words, and re
ord the number of hits
obtained. We performed 20 tests, over o period of time of 10 days, a test run at every
12 hours. The overall results for these tests showed that, given AV as an average of
the number of hits for a parti
ular word:
90% of the times the hits were in the range [0.99 x AV - 1.01 x AV
100% of the times the hits were in the range [0.85 x AV - 1.15 x AV
Taking into
onsideration the size of the information found on Internet and the
fa
t that this information is highly unstru
tured, the small variations a
hieved by
AltaVista in sear
hing the Internet
an
lassify this sear
h engine as a reliable one.
2.2.2.
TREC topi s
The Text Retrieval Conferen
es (TREC) are part of the TIPSTER Program,
and are intended to en
ourage resear
h in information retrieval from large texts. The
information needs are des
ribed by data stru
tures
alled topi
s.
The TIPSTER proje
t distinguishes between two dierent types of queries: ad
<
title> se tion lassies the topi within a domain; the <des > se tion gives a brief
des
ription of the topi
(for TREC-6, this se
tion was intended to be an initial sear
h
query); the <narr> se
tion provides a further explanation of what a relevant material
may look like.
riminal a
tivity?" .
After retrieving the information using the derived questions, the relevan
e of
the information has been evaluated based on the narrative se
tion of ea
h topi
.
13
CHAPTER 3
WORD SENSE DISAMBIGUATION
3.1. A word-word dependen
y approa
h
The method presented here takes advantage of the senten
e
ontext. The words
are paired and an attempt is made to disambiguate one word within the
ontext of the
other word. This is done by sear
hing on Internet with queries formed using dierent
senses of one word, while keeping the other word xed. The senses are ranked simply
by the order provided by the number of hits. A good a
ura
y is obtained, perhaps
be
ause the number of texts on the Internet is so large. In this way, all the words
are pro
essed and the senses are ranked. We use the ranking of senses to
urb the
omputational
omplexity in the step that follows. Only the most promising senses
are kept.
The next step is to rene the ordering of senses by using a
ompletely dierent
method, namely the semanti
density. This is measured by the number of
ommon
words that are within a semanti
distan
e of two or more words. The
loser the
semanti
relationship between two words the higher the semanti
density between
them. We introdu
e the semanti
density be
ause it is relatively easy to measure it
on a MRD like WordNet. A metri
is introdu
ed in this sense whi
h when applied to
all possible
ombinations of the senses of two or more words it ranks them.
An essential aspe
t of the WSD method presented here is that it provides a
raking of possible asso
iations between words instead of a binary yes/no de
ision for
ea
h possible sense
ombination. This allows for a
ontrollable pre
ision as other
modules may be able to distinguish later the
orre
t sense asso
iation from su
h a
small pool [27, [29.
14
W2
Form a similarity list for ea h sense of one of the words. Pi k one of the words, say
W2
, and using WordNet, form a similarity list for ea h sense of that word. For this,
use the words from the synset of ea
h sense and the words from the hypernym synsets.
Consider, for example, that W2 has m senses, thus W2 appears in m similarity lists:
(W21 ; W21(1) ; W21(2) ; :::; W21(k1 ) )
(W22 ; W22(1) ; W22(2) ; :::; W22(k2 ) )
...
(W2m ; W2m(1) ; W2m(2) ; :::; W2m(k ) )
m
where
1
W2
2
W2
, ...,
W2
W2
, and
i(s)
W2
(W1
(W1
1
W2 ; W1
1(1)
W2
; W1
1(2)
W2
; :::; W1
1(k1 )
W2
2
W2 ; W1
2(1)
W2
; W1
2(2)
W2
; :::; W1
2(k2 )
W2
...
(W1
3.
W2 ; W1
m(1)
W2
; W1
m(2)
W2
; :::; W1
m(km )
W2
Sear h the Internet and rank the senses W i(s) . A sear h performed on the Internet
for ea
h set of pairs as dened above, results in a value indi
ating the frequen
y of
o
urren
es for W1 and the sense of W2 . In our experiments we used AltaVista [3
sin
e it is one of the most powerful sear
h engines
urrently available. Using the
15
W1
i(s)
W2
set
above:
(a) (\W1 * W2i *" OR \W1 * W2i(1) *" OR \W1 * W2i(2) *" OR ... OR \W1 * W2i(k ) *")
i
(b) ((W1 * NEAR W2i*) OR (W1 * NEAR W2i(1) *) OR (W1 * NEAR W2i(2) *) OR ... OR
(W1 * NEAR W2i(k ) *))
i
for all 1
i
of hits with morphologi
ally related words. Using one of these queries, we get the
number of hits for ea
h sense i of the noun and this provides a ranking of the m senses
of W2 as they relate with W1 .
A similar algorithm is used to rank the senses of W1 while keeping W2
onstant
(not-disambiguated). Sin
e these two pro
edures are done over a large
orpora (the
Internet), and with the help of similarity lists, there is little
orrelation between the
results produ
ed by the two pro
edures.
3.2.2. Pro
edure Evaluation
This method was tested on 384 pairs: 200 verb-noun (le br-a01, br-a02), 127
adje
tive-noun (le br-a01), and 57 adverb-verb (le br-a01), extra
ted from SemCor
1.6 of the Brown
orpus. Using query form (a) on Alta Vista, we obtained the results
shown in Table 3.1. The table indi
ates the per
entages of
orre
t senses (as given
by SemCor) ranked by us in top 1, top 2, top 3, and top 4 of our list.
We
on
luded that by keeping the top four
hoi
es for verbs and nouns and the
top two
hoi
es for adje
tives and adverbs, we
over with high per
entage (mid and
upper 90's) all relevant senses. Looking from a dierent point of view, the meaning of
the pro
edure so far is that it ex
ludes the senses that do not apply, and this
an save
a
onsiderable amount of
omputation time as many words are highly polysemous.
16
We also used the query form (b), but the results obtained were similar; using the
operator N E AR, a larger number of hits is reported, but the sense ranking remains
more or less the same.
Table 3.1. Statisti
s gather from the Internet for 384 word pairs
top 1 top 2 top 3 top 4
noun
76% 83% 86% 98%
verb
60% 68% 86% 87%
adje
tive 79.8% 93%
adverb
87% 97%
3.3. Con
eptual density algorithm
A measure of the relatedness between words
an be a knowledge sour
e for
several de
isions in NLP appli
ations. The approa
h we take here is to
onstru
t a
linguisti
ontext for ea
h sense of the verb and noun, and to measure the number
of the
ommon nouns shared by the verb and the noun
ontexts. In WordNet ea
h
on
ept has a gloss that a
ts as a mi
ro-
ontext for that
on
ept. This is a ri
h
sour
e of linguisti
information that we found useful in determining
on
eptual density
between words.
3.3.1. Algorithm 2
Input: semanti
ally untagged verb - noun pair and a ranking of noun senses (as
determined by Algorithm 1)
Output: sense tagged verb - noun pair
Pro
edure:
1.
the possible senses of the verb and the noun using WordNet.
17
2.
Using Algorithm 1, the senses of the noun are ranked. Only the rst t possible
senses indi
ated by this ranking will be
onsidered. The rest are dropped to redu
e
the
omputational
omplexity.
3.
nj
vi
Cij
j dij j
Cij
wk
(1)
where:
wk
des endentsj
4. Cij
vi
and nj
ranks ea h pair vi
nj
nj
. The
sub-hierar hy of verb
hi
of the
vi
verb vi . It is ne
essary to
onsider a larger hierar
hy then just the one provided by
18
synonyms and dire
t hyponyms. As we repla
ed the role of a
orpora with glosses,
better results are a
hieved if more glosses are
onsidered. Still, we do not want to
enlarge the
ontext too mu
h.
2. As the nouns with a big hierar
hy tend to have a larger value for
j ij j,
d
the weighted sum of
ommon
on
epts is normalized with respe
t to the dimension
of the noun hierar
hy. Sin
e the size of a hierar
hy grows exponentially with its
depth, we used the logarithm of the total number of des
endants in the hierar
hy, i.e.
(
).
3. We also took into
onsideration and have experimented with a few other metri
s. But after running the program on several examples, the formula from Algorithm
2 provided the best results.
3.4. An example
As an example, let's
onsider the verb-noun
ollo
ation revise law. The verb
revise has two possible senses in WordNet 1.6 and noun law has seven senses.
First, Algorithm 1 was applied and sear
h the Internet using Alta Vista, for all
possible pairs V-N that may be
reated using revise and the words from the similarity
lists of law. The following ranking of senses was obtained: law#2(2829), law#3(648),
j ij j - the number of
ommon
on
epts between the verb and noun hierar
hies; (2)
d
des endantsj
the total number of nouns within the hierar hy of ea h sense nj ; and (3)
Cij
for ea h pair ni
above.
19
vj
n2
revise
#1=2
j ij j
d
n2
v1
v2
5
0
n3
4
0
des
endantsj
n2
n3
975
975
1265
1265
Cij
n2
n3
0.30 0.28
0
0
20
When evaluating these results, one should take into
onsideration that:
1. Using the glosses as a base for
al
ulating the
on
eptual density has the advantage
of eliminating the use of a large
orpus. But a disadvantage that
omes from the use
of glosses is that they are not part-of-spee
h tagged, like some
orpora are (i.e. Treebank). For this reason, when determining the nouns from the verb glosses, an error
rate is introdu
ed, as some verbs (like make, have, go, do) are lexi
ally ambiguous
having a noun representation in WordNet as well. We believe that future work on
part-of-spee
h tagging the glosses of WordNet will improve our results.
2. The determination of senses in SemCor was done of
ourse within a larger
ontext,
the
ontext of senten
e and dis
ourse. By working only with a pair of words we do
not take advantage of su
h a broader
ontext. For example, when disambiguating the
pair prote
t
ourt our method pi
ked the
ourt meaning \a room in whi
h a law
ourt
sits" whi
h seems reasonable given only two words, whereas SemCor gives the
ourt
meaning \an assembly to
ondu
t judi
ial business" whi
h results from the senten
e
ontext (this was our se
ond
hoi
e). In the next se
tion we extend our method to
more than two words disambiguated at the same time.
3.5.2. Comparison with other methods
As indi
ated in [38, it is di
ult to
ompare the WSD methods, as long as distin
tions reside in the approa
h
onsidered (MRD-based methods, supervised or unsupervised statisti
al methods), and in the words that are disambiguated. A method
that disambiguates unrestri
ted nouns, verbs, adverbs and adje
tives in texts is presented in [42; it attempts to exploit sentential and dis
ourse
ontexts and is based on
the idea of semanti
distan
e between words, and lexi
al relations. It uses WordNet
and it was tested on SemCor.
They report an average a
ura
y of 85.7% for nouns, 63.9% for verbs, 83.6% for
adje
tives and 86.5% for adverbs, slightly less than our results.
21
22
The results are shown the Tables 3.4, 3.5, 3.6.2 and 3.6.2 below, where the
on
eptual density
al
ulated for the sense #i of word X is presented in the
olumn
denoted by C #i:
Table 3.4. CD for bomb- pairs
#1
bomb-
ause
0.57
bomb-damage 5.09
bomb-injury 2.69
8.35
X
#2
0
0.13
0.15
0.28
0
0
0
0
#3
#2
1.34
2.64
1.75
5.73
d
By sele
ting the largest values for the
on
eptual density, the words are tagged
with their senses as follows:
damage or hurt" (hypernym: a
ident), and the sense provided in SemCor (#1/4) is
dened as \any physi
al damage"(hypernym: health problem).
This is a typi
al example of a mismat
h
aused by the ne granularity of senses
in WordNet whi
h translates into a human judgment that is not
lear
ut. We think
that the sense sele
tion provided by our method is justied, as both damage and injury
are obje
ts of the same verb
ause; the relatedness of damage(#1/5) and injury(#2/4)
23
is larger, as both are of the same
lass noun.event as opposed to injury(#1/4) whi
h
is of
lass noun.state.
Some other randomly sele
ted examples
onsidered were:
#2
2.14
2.63
2.57
7.34
d
#3
1.95
0.17
3.24
5.36
d
#4
0.88
0.16
1.56
2.60
d
#5
2.16
3.80
7.59
13.55
d
#2
5.35
4.48
10.40
20.23
d
#3
0.41
0.05
0.81
1.27
d
#4
2.28
0.01
9.69
11.98
d
where senten
es 2a, 3a and 4a are extra
ted from SemCor, with the asso
iated senses
for ea
h word, and senten
es 2b, 3b and 4b show the verbs and the nouns tagged with
their senses by our method. The only dis
repan
y is for the word broke and perhaps
24
this is due to the large number of its senses. The other word with a large number of
senses explode was tagged
orre
tly, whi
h was en
ouraging.
Considering only pairs of two words, the words in the 4 senten
es have been
tagged as it follows:
25
CHAPTER 4
A NATURAL LANGUAGE INTERFACE FOR
SEARCHING THE INTERNET
4.1. Sear
hing the Internet: NLP solutions for its drawba
ks
Some of the problems of the
urrent sear
h engines have been already pointed
out in Chapter 1. These in
lude the fa
t that many of the do
uments retrieved for
general queries are totally irrelevant to the subje
t of interest and relevant do
uments
may be missing be
ause the query does not
ontain the exa
t keywords. On the other
hand, more rened query, with more restri
tive boolean operators, may result in a
few or even no do
uments.
Another problem is that typi
ally the users are non-
omputer professionals;
they tend to use natural language questions, whi
h are far more
omplex than the
simple words or
ombinations of words a
epted by the sear
h engines. For example,
for nding the presidents of the United States during the last
entury, a
ommon
user will ask \Who were the US Presidents of the last
entury?", rather than using
a query formed with boolean operators, su
h as (US NEAR Presidents) AND (last
NEAR
entury).
In this thesis we examine some of the benets of using Natural Language Pro
essing in
onjun
tion with WordNet, to improve the quality of the results. Spe
ifi
ally, the thesis des
ribes a system that addresses two issues: (1) the translation
of a natural language question or senten
e into a query and query extension using
WordNet, and (2) extra
tion of paragraphs that render relevant information from the
do
uments fet
hed by the sear
h engines.
The performan
e of a sear
h engine is measured based on both the relevan
e
of the information retrieved and the number of relevant do
uments retrieved. The
26
evaluation methodology
an be based on two fa
tors: the pre
ision and the resolution.
Usually, the pre
ision is dened as the ratio between the number of relevant do
uments
retrieved over the total number of do
uments retrieved. In the
ase of sear
hing the
Internet, this parameter is hard to be
omputed, as it is very di
ult to
he
k all the
do
uments returned by a sear
h, in order to determine the relevant ones. Generally, a
user does not
he
k more than 10 do
uments, as they are returned by a sear
h engine;
for this reason, we redened this fa
tor su
h as it measures the number of relevant
do
uments found in the top 10 hits. In this thesis, from now on, the term pre
ision
will refer to this new metri
, while the term full-pre
ision will refer to the old meaning
of pre
ision, i.e. the number of relevant do
uments retrieved over the total number of
do
uments retrieved. Another important fa
t whi
h has to be determined is whether
or not a user found an answer to her question. For this purpose, we dened a new
parameter,
alled resolution, whi
h represents the number of answered questions from
a set of posed questions.
The rst step of our method, namely the query extension phase, is intended to
in
rease the number of possibly relevant do
uments retrieved by a sear
h, while the
se
ond step of fet
hing relevant information in
reases the pre
ision. Both these steps
are intended to in
rease the resolution [33.
4.2. Previous work
Two main approa
hes have been
onsidered by resear
hers in trying to improve
the quality of the sear
h on Internet or large
olle
tions of texts.
The rst one is to make use of multiple sear
h engines and
reate a meta sear
h
engine [41, [16. This will result in an in
reased number of do
uments, as they are
retrieved based on the information stored in multiple sear
h engine databases. The
hard task in this approa
h is that dierent sear
h engines are largely in
ompatible and
do not always allow for interoperability. Solving this problem implies a uni
ation
of both the query language and the type of results returned by the dierent sear
h
engines.
27
The se
ond approa
h is to use NLP te
hniques. Here, work has been developed
in two dire
tions. (1) The use of query extension te
hniques to in
rease the number
of do
uments retrieved. (2) Improve the quality of the information retrieved using
NLP-based systems: REASON [4, INQUIRY [11.
Query expansion has been proved to have positive ee
ts in retrieving relevant
information [24. The purpose of query extension
an be either to broaden the set
of do
uments retrieved or to in
rease the retrieval pre
ision. In the former
ase, the
query is expanded with terms similar with the words from the original query, while
in the se
ond
ase the expansion pro
edure adds
ompletely new terms.
There are two main te
hniques used in expanding an original query. The rst
one
onsiders the use of Ma
hine Readable Di
tionary; [44 and [2 are making use of
WordNet to enlarge the query su
h as it in
ludes words whi
h are semanti
ally related
to the
on
epts from the original query. The basi
semanti
relation used in their
systems is the synonymy relation; still, these te
hniques allow a further extension of
the query, by using other semanti
relations whi
h
an be derived from a MRD, like
for example the hypernymy and hyponymy relations.
A se
ond te
hnique
onsidered by resear
hers for query expansion is to use words
derived from relevant do
uments. The SMART system, [8, developed at Cornell University, does massive query expansion, adding from 300 to 530 terms to ea
h query,
terms whi
h are derived from relevant do
uments. They report a pre
ision improvement of 7% to 25% obtained during their experiments. Another method is proposed
by Ishikawa in [20; the original query is extended with words from paragraphs whi
h
are
onsidered to be relevant, based on a similarity measure between the paragraphs
and the original query. In [24, Lu evaluates the performan
e obtained during the
experiments performed with the TIPSTER
olle
tion. They prove that larger queries
an in
rease the full-pre
ision within a range from 0% to 20%.
28
Appli
ation These questions are more di
ult than the previous ones, in the way
that the information obtained has to be applied in order to obtain a result.
Types How is ... and example of ...?, How is ... related to ... ?.
Example Why are the
omputers important?
29
Class 4
Analysis This
lass of questions involves a subdivision of obje
ts/fa
ts, to show how
they are put together. The answer to these questions implies a
tually a separation of
the whole into
omponents.
Types Classify ... a
ording to ..., What eviden
e
an you list for...?.
Example What are the features of a mammal?
Class 5
Synthesis This is the opposite of the previous type. Here, the ideas and fa
ts have to
be
ompound to
reate a new whole.
Types What would you predi
t from...?, What solutions would you suggest for... ?
Example How would you design a new
omputer?
Class 6
Evaluation This is the hardest type of questions, as it involves making de
isions about
issues and developing opinions or judgments.
English sentence
WORDNET
LEXICAL PROCESSING:
(POS tagging, WSD)
Glosses from
WordNet
Keywords Xi
BOOLEAN OPERATORS
LEXICAL RESOURCES
Query
SEARCH ENGINE
Documents (ranked)
SEARCH
ENGINES
(AltaVista)
LEXICAL OPERATORS
Relevant paragraphs/sentences
INFORMATION
RETRIEVAL RESOURCES
31
xi
Level 1. Words semanti
ally similar, i.e. words belonging to the same synset.
Level 2. Words that express
on
epts linked by semanti
relations; i.e. hyponymy/
hypernymy, meronymy/ holonymy and entailment.
Level 3. Words that belong to sibling
on
epts; namely
on
epts that are subsumed
by the same
on
ept.
32
Consider, for example, the sense number 1 of the noun operation. There are
eight senses of this word in WordNet. The synset for the rst sense in
ludes only
the word itself. The hypernym synset for this rst sense in
ludes three words a
-
tion, a
tivity and a
tiveness. The sybling synsets are fagen
yg, fplayg, fbusynessg,
fbehavior, behaviourg, feruption, eru
tationg and fswingg.
The similarity lists that we
an now
reate,
orresponding to the three levels
dened above, are:
L1. foperationg
L2. foperation, a
tion, a
tivity, a
tivenessg
L3. foperation, a
tion, a
tivity, a
tiveness, agen
y, play, busyness, behavior, behaviour, eruption, eru
tation, swingg
xi
xi
list are
xi
where
xi
Wi
xi
. The elements of a
level of similarity with the word xi . These lists
an now be used for the a
tual query
formulation, using the boolean operators a
epted by the
urrent sear
h engines. The
OR
N E AR
Wi
, while the
AN D
and
N E AR
op-
AN D
or
erators are possible, there are two basi
forms giving the maximum, respe
tively the
minimum, of the number of do
uments retrieved:
(1) W1 AND W2 AND ... AND Wn
(2) W1 NEAR W2 NEAR ... NEAR Wn
In most of the
ases, the format (1) gathered thousands of do
uments, while the
format (2) had almost always null results.
The
on
lusion so far is that the do
uments
ontaining the answers, if any, must
be among the large number of do
uments provided by the AND operators. However,
the sear
h engines failed to rank them in the top of the list. Thus, we sought to nd
new operators that ltered out many of the irrelevant texts.
4.4.3. New operators
Our approa
h to ltering do
uments is to rst sear
h the Internet using weak
operators (AND, OR) and then to further sear
h this large number of do
uments using
more powerful operators. For this se
ond phase, we propose the following additional
operators:
PARAGRAPH n (... similarity lists ... )
The PARAGRAPH operator sear
hes like an AND operator for the words in the
similarity lists with the
onstraint that the words belong only to some n
onse
utive
paragraphs. The rationale is that most likely the information requested is found in a
34
few paragraphs rather than being dispersed over an entire do
ument. A similar idea
an be found in [10.
SENTENCE n (... similarity lists ... )
The SENTENCE operator sear
hes like an AND operator for the words in the similarity lists with the
onstraint that the words belong to a senten
e. The answers to
many queries are found in single, sometimes
omplex senten
es.
In order to apply these new operators, the do
uments gathered from the Internet
have to be segmentated into senten
es and paragraphs, respe
tively. Separating a text
into senten
es proves to be an easy task, one
ould just make use of the pun
tuation
to solve this problem. Instead, paragraph segmentation is mu
h more di
ult, and
this is due rst of all to the highly unstru
tured texts that
an be found on the Web.
Work developed in this dire
tion is presented in [18 and [10, but these methods work
only for stru
tured texts, with a priori known lexi
al separators (i.e. a tag, an empty
line et
.). Thus, we had to use a method that
overs almost all the possible paragraph
separators that
an o
ur in the texts on the web. The paragraph separators that we
onsidered so far are: (1) HTML tags; (2) empty lines; (3) paragraph indentation.
4.5. An example
The system operation is presented below with the help of an example. Suppose
one wants to nd the answer to the question: \How mu
h tax does an average salary
x2
x3
x4
x5
x6
In the notation above \pos" means part of spee
h, and the sense number indi
ates the a
tual WordNet sense that resulted from the disambiguation out of all
possible senses in WordNet. For instan
e adje
tive average has 5 senses and the system pi
ked sense #4. Note that the senses of words in WordNet are ranked in the
order of their utilization frequen
y in a large
orpora.
These keywords will be the input for the next step of our system, ex
ept those
keywords having a too high position in the WordNet hierar
hies. In our example
onsidering only Level 1 similarity, and only the rst four keywords WordNet provides:
W1
W2
W3
W4
= fUnited States, United States of Ameri
a, Ameri
a, US, U.S., USA, U.S.A.g
Table 4.1. Queries with various
ombinations of operators
1
2
3
4
5
6
7
Query
W1 AND W2 AND W3 AND W4
W1 AND (W2 NEAR W3 ) AND W4
W1 NEAR (W2 NEAR W3 ) AND W4
W1 NEAR W2 NEAR W3 NEAR W4
W1 AND faverage W3 g AND W4
W1 AND faverage W3 g NEAR W4
W1 NEAR faverage W3 g NEAR W4
Number of
do
uments
15,464
3,217
803
0
1752
1 (no)
0
These lists are used to formulate queries for the sear
h engine. As we will see,
the operators available today for the sear
h engines are not adequate to provide the
desired answers in most of the
ases. Table 4.1 shows some queries and the number
of do
uments provided by AltaVista,
onsidered to be one of the sear
h engines with
the most powerful set of operators available today.
36
The ranking provided by the AltaVista is of no use for us here. None of the leading do
uments in any
ategory provides the desired information. The only do
ument
fet
hed by Query 6 is equally irrelevant:
....Instead, their plans would shift more of the total tax burden on to labor, taxing
apital on
e under a business tax, and taxing wages and salaries twi
e under both the
in
ome tax and the payroll tax. Middle-
lass Ameri
ans have to pay more under su
h
a system, and wealthy people mu
h less....
....The average taxpayer must work 86 days to pay all federal taxes, and must work
36 days just to pay his or her federal in
ome tax. The average Ameri
an must work
2 hours and 49 minutes every working day to pay all their taxes.....
An analysis of the results in table above indi
ates that there is a gap in the
volume of do
uments retrieved with the Alta Vista operators. For instan
e using
only the AND operator (Query 1) 15,464 do
uments were obtained, but the NEAR
operator (Query 4) produ
ed no output. This operator seems to be too restri
tive,
while it fails to identify the right answer. Various
ombinations of AND and NEAR
operators were tried, as indi
ated by the table above with no great results.
Using the PARAGRAPH operator for the example above, the system found a relevant
answer:
For the purpose of testing our system, we onsidered 50 abstra t questions derived from the des riptive se tion of ea h of the 50 topi s in the TREC-6 set and 50
on
rete questions used by users to sear
h the Internet; let us denote these sets as
the TREC set, respe
tively the REAL set. Table 4.6 presents ve randomly sele
ted
questions from the TREC set and ve questions from the REAL set, together with
the results we obtained.
Table 4.2. A sample of the results obtained for randomly sele
ted questions
from the TREC and the REAL sets.
Question
AND
NEAR
AND
NEAR
Paragraph
Senten e
27,716
0
9,432
1
178
1
32,133
0
246
0
3
1
13
3
4
0
6,214
1
5
1
48,133
0
10,271
2
504
1
32,133
0
260
1
5
1
15
3
4
0
6,214
1
5
1
6
1
40
11
2
1
150
80
15
6
0
0
3
2
0
0
5
2
1
0
1,360
2
2
0
4503
0
36049
0
6
1
3
3
0
0
202
0
858
1
0
0
2608
2
30
1
10221
0
36049
0
70
0
35
5
0
0
575
1
858
1
0
0
61
34
10
1
117
10
15
8
6
6
6
3
0
0
1
0
0
0
1
0
xi
TREC questions
xi
wi
wi
wi
wi
Ea
h
ell in this table in
ludes two numbers: the upper one represents the total number of do
uments retrieved for the question, respe
tively the total number of
paragraphs retrieved when the PARAGRAPH operator was used. The bottom number represents the number of relevant do
uments found in top 10 ranking, respe
tively
the number of relevant paragraphs.
The AND
xi
and NEAR
xi
xi
words xi with their similarity lists derived from WordNet, the number of do
uments
retrieved in
reased, as expe
ted. The results obtained in these
ases, with an AND,
38
respe
tively a NEAR operator applied to the similarity lists, are presented in the
olumns AND wi and NEAR wi .
The next
olumn
ontains the number of do
uments extra
ted when the new
operator PARAGRAPH 2 (meaning two
onse
utive paragraphs) was applied to words
from the similarity lists. The results were en
ouraging. The number of do
uments
retrieved was small and
orre
t answers were found almost in all
ases.
The last
olumn shows that the operator SENTENCE was too restri
tive, produ
ing
orre
t answers only in few
ase.
A summary of the results, for the 100 questions that we used to test our system,
is presented in Table 4.6. In ea
h box from this table, the top number indi
ates the
number of do
uments retrieved, and the bottom number indi
ates the relevant do
uments in top 10 ranking, and respe
tively the total number of relevant paragraphs.
Table 4.3. Summary of results for the 100 test questions. In ea
h box, the top
number indi
ates the number of do
uments retrieved, and the bottom number
indi
ates the relevant do
uments in top 10 ranking, and respe
tively the total
number of relevant paragraphs.
Question
Average question from the TREC set
AND NEAR
xi
7,746
0.16
Average question from the REAL set 13,510
0.63
xi
258 25,803
0.48 0.44
1,843 28,715
1.24 0.61
wi
332
0.88
3,003
1.36
wi
26.04
11.10
48.95
13.58
results presented in the table above, the query extension in
reased the number of
do
uments retrieved, and also the number of relevant do
uments in top 10 ranking.
With the PARAGRAPH operator, the number of relevant do
uments (paragraphs) retrieved de
reased signi
antly. Table 4.4 presents the pre
ision and the
resolution of the sear
h performed with AltaVista for the two sets of questions. Again,
the pre
ision here is dened as the number of relevant do
uments retrieved in top 10
hits, while the resolution measures the number of questions for whi
h an answer
ould
be found.
Table 4.4. Pre
ision and resolution for sear
hes performed with
AltaVista for the 100 test questions.
AND NEAR
Question
xi
Pre ision
xi
AND NEAR
wi
wi
Average question from the TREC set 1.6% 4.8% 4.4% 8.8%
Average question from the REAL set 6.3% 12.43% 6.09% 13.65%
Resolution
36%
30%
44%
42%
20%
28%
36%
48%
For the 100 tests that we performed using this new operator, we obtained a
full-pre
ision (as it is dened in se
tion 4.1) of 42.6% in the
ase of TREC questions, and 27.7% in the
ase of REAL questions. The resolution, i.e. the number of
answered questions, was 90% in the
ase of TREC questions, and 66% in the
ase of
REAL questions;
omparing this with the resulution of the sear
hes performed with
AltaVista, as shown in table 4.4, it results that more questions have been answered
using our system, than using AltaVista. We explain the fa
t that the resolution of
the se
ond set was smaller than the one of the rst set in the next subse
tion: there
are several
ases when no answer
ould be nd, due either to very general or very
spe
i
terms used when formulating the questions.
40
is the land like in Costa Ri
a?". The queries formed with boolean operators led to
41
26,304 do
uments when the AND operator has been used, respe
tively 1,956 do
uments when we used the NEAR operator, with no relevant information in top 10
ranked do
uments. The query expansion phase brought no modi
ations in the stru
ture of the query, as both \land" and \Costa Ri
a" have no synonyms. Using the
PARAGRAPH operator, 713 paragraphs have been retrieved, too many to be useful
in the pro
ess of nding a relevant answer.
At the other extreme, questions with very spe
ialized terms lead to no results.
For the question \Where
an I nd a
artoon depi
ting the Sugar A
t of 1764?", we
obtained 0 do
uments using an AND-query, and 0 do
uments with a NEAR-query.
The query expansion phase in
reased the former number to 15 (still, with no relevant
information in these do
uments), and the PARAGRAPH operator
ould not nd any
relevant information.
4.6.3. Extensions.
A more
exible NEAR sear
h
ould be implemented with a new operator SEQUENCE (W1 dW2 d::::Wn ), where d is a numeri
variable that indi
ates the distan
e
between the words in the
operator requires that the sequen
e of the words in the similarity list be maintained
as spe
ied.
There are several other posible ways of improving the web sear
h not dis
ussed
in this thesis. One su
h a possibility is to index words by their WordNet senses.
This of
ourse implies some on-line word-sense disambiguation of do
uments whi
h
may be possible in not too distant future. Semanti
indexing has the potential of
improving the ranking of sear
h results, as well as to allow information extra
tion of
obje
ts and their relationships [36. Another way to improve the web sear
h is to use
ompound nouns or
ollo
ations. In WordNet there are thousands of groups of words
su
h as mother in law, sto
k market, et
., that point to their respe
tive
on
ept. Ea
h
ompound noun is better indexed as one term. This redu
es the storage spa
e for the
sear
h engine and may in
rease the pre
ision.
42
CHAPTER 5
CONCLUSIONS AND FURTHER WORK
WordNet is a ne grain MRD and this makes it more di
ult to pinpoint the
orre
t sense
ombination sin
e there are many to
hoose from and many are semanti
ally
lose. For appli
ations su
h as ma
hine translation, ne grain disambiguation
works well, but for information extra
tion and some other appli
ations this is an
overkill, and some senses may be lumped together. The ranking of senses is useful
for many appli
ations.
In this thesis, we presented a method whi
h disambiguates all the nouns, verbs,
adje
tives and adverbs in a text. Is is based on the idea of word-word dependen
ies
found within a
ontext. Our WSD method
ombines both the statisti
al results, as
they are gathered from the Internet, and semanti
density measures,
al
ulated on
WordNet hierar
hies.
An important aspe
t of this method is that it provides a sense ranking over the
possible senses that words might have, for a given
ontext. This information proves
to be useful in tasks su
h as information retrieval, where knowing the rst possible
orre
t senses of the words in the input question/senten
e
an highly improve the
quality of the information retrieved.
The main drawba
k of this method regards the time needed to perform the
disambiguation of words. Both algorithms are time
onsuming:
1. The rst algorithm gather statisti
s from the Internet. Even it provides us with
good results in ranking word senses, this algorithm has the inherent problems
related to the use of the Internet, su
h as network
ongestion, sites found at
long distan
e or sites whi
h are unrea
hable.
43
2. The se
ond algorithm
al
ulates the semanti
distan
e between words, using
WordNet. As des
ribed in se
tion 3.3, this makes use of the semanti
hierar
hies
of the dierent senses of the words; the time needed to
ompute this metri
varies dire
tly with the size of these hierar
hies, and for large hierar
hies the
time will be of the order of minutes.
Table 5 presents the average time needed to exe
ute these algorithms, as determined for 20 verb-noun pairs.
Table 5.1. Time measurements for the disambiguation
method
Algorithm
Time (se
.)
Algorithm 1 (statisti
s from the Internet)
180
Algorithm 2 (semanti
density)
50
Total
230
The average time for disambiguating a pair of words is 230 se
onds. This does
not
onstitute a big problem when the disambiguation has to be done for small texts,
like questions or senten
es. But it be
omes overwhelming when large texts have to
be semanti
ally tagged. In this later
ase, a dierent approa
h should be
onsidered.
To have a robust WSD system whi
h
ould work for large texts, as well as for
small senten
es, further work is needed.
1. Improve the POS tagger with
on
ept identi
ation. As it is now, Brill's POS
tagger re
ognizes only single words; for example, \
hild
are", whi
h is identied
by WordNet as being a single noun
on
ept, is tagged by Brill's tagger as two
separate nouns, \
hild" and \
are". A possible solution to this problem is to
ombine the lexi
on provided by WordNet with the POS tagged developed by
Eri
Brill, su
h as to obtain an improved tagger whi
h will re
ognize the lexi
al
expressions.
44
2. Identify the synta
ti
relations in the input text. The basi
relations whi
h have
to be identied are (verb, obje
t), (verb, subje
t), (adje
tive, noun), (adverb,
verb). For this purpose, we plan to use either the synta
ti
parser developed at
University of Pennsylvania, or the one from Xerox, depending on the a
ura
y
whi
h we will obtain during tests.
3. An error rate is introdu
ed into our semanti
density method, be
ause of the
fa
t that the glosses are not POS tagged, nor synta
ti
ally parsed. We believe
that a signi
ant improvement
an be a
hieved if we build the
ontext of a
parti
ular word by taking into
onsideration the synta
ti
relations. At this
point, we have to make a
hoi
e.
(a) We
an try to repla
e the examples provided by the glosses with examples
found in SemCor. The les from SemCor are already POS tagged, and we
will have to identify also the synta
ti
relations in these texts. The original
algorithm will be modied su
h as to
ompute the synta
ti
density among
words by taking into
onsideration only pairs of words in the same synta
ti
relation.
(b) We
an
ontinue to use the glosses, but we should modify them su
h as
POS and synta
ti
information is in
luded.
Based on the tests whi
h we are going to perform, regarding the a
ura
y
a
hieved and the time performan
e of ea
h of these methods, we will de
ide
whi
h approa
h
an lead to a more robust WSD system.
As an appli
ation of the WSD method, we presented a system for sear
hing
the Internet, improved with NLP te
hniques. This system, des
ribed in Chapter 4,
performs two main tasks: (1) it expands the input question; the sear
h is extended
with words semanti
ally related to the keywords from the original question; (2) new
lexi
al operators have been used to fet
h the relevant information from the do
uments
found on the Internet. The results obtained are promising: 42.6% in the
ase of TREC
45
questions, and 27.7% in the
ase of REAL questions represent pre
isions higher (or
omparable) with previous results in this eld.
From the types of questions presented in se
tion 4.3, our system is limited to
the rst
lass of questions, whi
h involves basi
ally an identi
ation of information.
This is not a hard limitation though, as studies performed in this eld proved that
80% of the questions usually asked by people are in
luded in this
lass. Further work
will address the se
ond
lass of questions; for this, the system will have to identify
several fa
ts related to the subje
t of the question, and eventually summarize these
fa
ts in order to give an answer.
The broad use of natural language queries in information retrieval is still beyond
the
apabilities of
urrent natural language te
hnology. Ma
hine readable di
tionaries,
su
h as WordNet, prove to be useful tools for web sear
h. However, their use for
sear
hing the Internet has been limited so far [2, [19, [21.
46
REFERENCES
[1
[2
Allen, B.
[5
Anikina,
gatsky, B. Reason: NLP-based sear
h system for WWW. In Pro
eedings of the
Ameri
an Asso
iation for Arti
al Intelligen
e Conferen
e, Spring Symposium,
\NLP for WWW" (Stanford University, CA, 1997), pp. 1{10.
Bloom, B., Engelhart, M., Furst, E., Hill, W., and Krathwohl, D.
[6
Brill, E.
[7
[8
[9
[10
[11
models. In Pro eedings of the 32nd Annual Meeting of the Asso iation for Computational Linguisti s (ACL-94) (LasCru es, NM, June 1994), pp. 139{146.
Automati Query
Knowledge-based informa-
tion retrieval from semi-stru
tured text. In Pro
eedings of the Ameri
an Asso
iation for Arti
al Intelligen
e Conferen
e, Fall Symmposium, \AI Appli
ations
in Knowledge Navigation & Retrieval" (Cambridge, MA, 1995).
Callan, J.
17th Annual International ACM SIGIR, Conferen
e on Resear
h and Development in Information Retrieval (Dublin, Ireland, 1994), pp. 302{310.
Callan, J., Croft, W., and Harding, S.
tem. In Pro
eedings of the 3rd International Conferen
e on Database and Expert
Systems Appli
ations (1992), pp. 78{83.
47
[12
[13
[14 FindLaw, internet legal resour
es. http://www.ndlaw.
om/index.html, 1997.
[15
Pro
eedings of the DARPA Spee
h and Natural Language Workshop (Harriman,
New York, 1992).
[16
Gravano,
L.,
Chang,
K.,
Gar ia-Molina,
H.,
Lagoze,
C.,
and
[17
[18
Hearst, M.
of the 32nd Annual Meeting of the Asso
iation for Computational Linguisti
s
(Las Cru
es, NM, 1994), pp. 9{16.
[19
the navigation of retrieval results. In Pro eedings of the Ameri an Asso iation for
[21
Katz, B.
[22
Leong, M.
[23
[24
[25
Building a large
annotated
orpus of english: the Penn Treebank. Computational Linguisti
s 19,
2 (1993), 313{330.
Mar
us, M., Santorini, B., and Mar
inkiewi
z, M.
48
[26
M Roy, S.
[27
[28
[29
Mihal
ea, R., and Moldovan, D. A method for Word Sense Disambiguation
of unrestri
ted text. In Pro
eedings of the 37th Annual Meeting of the Asso
iation for Computational Linguisti
s (ACL-99) (Maryland, NY, June 1999). (to
appear).
[30
Miller, G.
[31
Miller, G., Chodorow, M., Landes, S., Lea o k, C., and Thomas,
[32
[33
[34
[35
[36
[37
R. Using a semanti
on
ordan
e for sense identi
ation. In Pro
eedings of the
4th ARPA Human Language Te
hnology Workshop (1994), pp. 240{243.
Miller, G., Lea
o
k, C., Randee, T., and Bunker, R.
A semanti
Moldovan, D. e. a.
49
[38
[39
[40
[41
[42
A perspe
tive on Word Sense Disambiguation methods and their evaluation. In Pro
eedings of ACL Siglex Workshop on
Tagging Text with Lexi
al Semanti
s, Why, What and How? (Washington DC,
April 1997).
Combining unsupervised lexi
al
knowledge methods for Word Sense Disambiguation. Computational Linguisti
s
(1997).
Computer evaluation of indexing and text pro
essing. Prenti
e Hall, Ing. Englewood Clis, New Jersey, 1971, pp. 143{180.
Multi-servi
e sear
h and
omparison using the
Meta
rawler. In Pro
eedings of the 4th International World Wide Web Conferen
e (Boston, MA, 1995), pp. 195{208.
Selberg, E., and Etzioni, O.
General Word Sense Disambiguation method based on a full sentential
ontext. In Usage of WordNet
in Natural Language Pro
essing, Pro
eedings of COLING-ACL Workshop (Montreal, Canada, July 1998).
[44
Voorhees, E.
[45
Yarowsky, D.
methods. In Pro
eedings of the 33rd Annual Meeting of the Asso
iation for
Computational Linguisti
s (ACL-95) (Cambridge, MA, 1995 1995), pp. 189{196.
50