You are on page 1of 60

WORD SENSE DISAMBIGUATION AND ITS

APPLICATION TO INTERNET SEARCH

Approved by:

Dr. Dan Moldovan

Dr. Sanda Harabagiu

Dr. Margaret Dunham

Dr. Ri hard Helgason

Dr. Peter Raad

WORD SENSE DISAMBIGUATION AND ITS


APPLICATION TO INTERNET SEARCH
A Master Thesis Presented to the Graduate Fa ulty of the
S hool of Engineering and Applied S ien e
Southern Methodist University
in
Partial Ful llment of the Requirements
for the degree of
Master in Computer S ien e
with a
Major in Computer S ien e
by

Rada Mihal ea
(B.S., Te hni al University of Cluj-Napo a, 1997)
May 15, 1999

ACKNOWLEDGMENTS

I want rst of all to thank my advisor, Dr. Dan I. Moldovan, for his ontinuous
support and help during my graduate study, and for his valuable advi es in writing
this thesis. I want to thank also the fa ulty, sta and students from the Computer
S ien e Department for their help and support.

iii

Mihal ea, Rada

B.S., Te hni al University of Cluj-Napo a, 1997

Word Sense Disambiguation and its


Appli ation to Internet Sear h
Advisor: Professor Dan Moldovan
Master in Computer S ien e degree onferred May 15, 1999
Master Thesis ompleted May 15, 1999
Sele ting the most appropriate sense for an ambiguous word in a senten e is a
entral problem in Natural Language Pro essing. The solution impa ts other tasks
su h as information retrieval, dis ourse, oheren e, inferen e and others. This thesis
presents a method that attempts to disambiguate all the nouns, verbs, adverbs and
adje tives in a text, using the senses provided in WordNet. The senses are ranked
using two sour es of information: (1) the Internet for gathering statisti s for wordword o-o urren es and (2) the semanti density for a pair of words, measured on
WordNet hierar hies.
An essential aspe t of the Word Sense Disambiguation method presented here is
that it provides a ranking of possible asso iations between words senses, rather than a
binary yes/no de ision for a possible sense ombination. This proves to be parti ularly
useful for Natural Language Pro essing tasks su h as retrieving information related
to a parti ular input question.
An important task whi h an highly bene t from a Word Sense Disambiguation
method is the Internet sear h. This thesis presents a possible appli ation of Word
Sense Disambiguation te hniques for improving the quality of the sear h on the Internet. Knowing the sense of the words in the input question enables the reation
of similarity lists whi h ontain words semanti ally related to the original keywords,
and whi h an be further used for query extension. The extended query, together
with the new lexi al operators de ned for information extra tion, improve both the

pre ision and the resolution of a sear h on the Internet.


iv

Spe i ally, we des ribe a system that addresses two issues: (1) the translation
of a natural language question or senten e into a query and query extension using
WordNet and (2) extra tion of paragraphs that render relevant information from the
do uments fet hed by the sear h engines.

TABLE OF CONTENTS

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii


LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. BACKGROUND ON RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1. Lexi al resour es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1. WordNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1.1. The glosses in WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.2. SemCor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.3. Brill's part of spee h tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


2.2. Information Retrieval resour es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1. AltaVista . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2.

TREC topi s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3. WORD SENSE DISAMBIGUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


3.1. A word-word dependen y approa h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2. Contextual ranking of word senses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1. Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2. Pro edure Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3. Con eptual density algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1. Algorithm 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
vi

3.4. An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5. Evaluation and omparison with other methods . . . . . . . . . . . . . . . . . . . . 20
3.5.1. Tests against SemCor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2. Comparison with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6.1.

Noun-noun and verb-verb pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6.2. Larger window size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


3.7. Con lusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4. A NATURAL LANGUAGE INTERFACE FOR
SEARCHING THE INTERNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1. Sear hing the Internet: NLP solutions for its drawba ks . . . . . . . . . . . . 26
4.2. Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3. Types of questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4. System ar hite ture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1. Lexi al pro essing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2. Query formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.3. New operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5. An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6.1. Dis ussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.2. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.3. Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5. CONCLUSIONS AND FURTHER WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
vii

LIST OF FIGURES

Figure

Page

1.1. A omparison between two Internet sear h systems ar hite tures:


(left) the basi Internet sear h system urrently used by sear h
engines, and (right) an Internet sear h system improved with
NLP te hniques. The dark re tangles represent stages in the
sear h system performed with NLP methods. : : : : : : : : : : : : : : : : : : : : : :

2.1. A WordNet hierar hy

:::::::::::::::::::::::::::::::::::::::::::::::::

2.2. A fragment from SemCor : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2.3. A TIPSTER topi

::::::::::::::::::::::::::::::::::::::::::::::::::::

13

4.1. System organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31

viii

LIST OF TABLES

Table

Page

2.1. The number of words and on epts in WordNet 1.6

::::::::::::::::::::

2.2. The semanti on ordan es in SemCor 1.6 : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3.1. Statisti s gather from the Internet for 384 word pairs

::::::::::::::::::

17

3.2. Values used in omputing the on eptual density and the on eptual
density Cij : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
3.3. Final results obtained for 384 word pairs using both algorithms. : : : : : : : : 20
3.4. CD for bomb- pairs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
3.5. CD for ause- pairs

:::::::::::::::::::::::::::::::::::::::::::::::::::

3.6. CD for damage- pairs

:::::::::::::::::::::::::::::::::::::::::::::::::

23
24

3.7. CD for injury- pairs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24


4.1. Queries with various ombinations of operators

::::::::::::::::::::::::

36

4.2. A sample of the results obtained for randomly sele ted questions from
the TREC and the REAL sets. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38
4.3. Summary of results for the 100 test questions. In ea h box, the
top number indi ates the number of do uments retrieved, and
the bottom number indi ates the relevant do uments in top 10
ranking, and respe tively the total number of relevant paragraphs.

::

39

4.4. Pre ision and resolution for sear hes performed with AltaVista for
the 100 test questions. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40
5.1. Time measurements for the disambiguation method : : : : : : : : : : : : : : : : : : : : 44

ix

To my family.

CHAPTER 1
INTRODUCTION
Word Sense Disambiguation (WSD) is an open problem in Natural Language
Pro essing (NLP). There are several other NLP elds, like information retrieval, oheren e, dis ourse, whi h an bene t from an a urate WSD method.
WSD methods an be broadly lassi ed into three types:
1. WSD that makes use of the information provided by ma hine readable di tionaries: this is the ase with the work reported by Cowie [12, Miller [31, Aggire
[1, Li [23 and M Roy [26;
2. WSD that uses information gathered from training on a orpus that has already
been semanti ally disambiguated (supervised training methods); su h methods
have been implemented by Gale [15, Ng [35;
3. WSD that use information gathered from raw orpora (unsupervised training
methods); Yarowski [45 and Resnik [37 presented unsupervised WSD methods.
There are also hybrid methods that ombine several sour es of knowledge su h
as lexi on information, heuristi s, ollo ations and others: M Roy [26, Bru e [7, Ng
[35 and Rigau [39.
Statisti al methods produ e high a ura y results for small number of presele ted words. A la k of widely available semanti ally tagged orpora almost ex ludes
supervised learning methods. A possible solution for automati a quisition of sense
tagged orpora has been presented in [28, but the orpora a quired with this method
has not been yet tested for statisti al disambiguation of words. On the other hand, the
disambiguation using unsupervised methods has the disadvantage that the senses are
1

not well de ned. None of the statisti al methods disambiguate adje tives or adverbs
so far.
The WSD method proposed in this thesis attempts to disambiguate all the
nouns, verbs, adje tives and adverbs in a text, using the senses provided in WordNet
[13. To our knowledge, there is only one other method, re ently reported, that
disambiguates unrestri ted words in texts [42.
It is based on the idea of word-word dependen ies. The senses of the words are
determined by the ontext in whi h they appear. Given the senten es \ an you spell

the rst letter" and \he wrote a letter to the editor", the word \letter" has di erent
meanings: in the rst example, its sense is that of \alphabeti al letter", while in the
se ond example it means \a written message". The orre t sense an be sele ted if
we onsider the other words within the given ontext, and then determine the most
likely ombinations of word senses.
Our approa h for WSD is to disambiguate the words based on their o urren e
within word-word pairs. It is a tually a hybrid method whi h ombines two sour es
of information. In the rst step, the Internet is used to gather statisti s for word
o-o urren es. The ranking of senses provided by this rst step will be used as input
to the se ond algorithm: the semanti density of a word pair is al ulated using the
information provided by WordNet hierar hies.
With this method, we a tually rank the possible orre t senses of the words.
This an be parti ularly useful for tasks su h as information retrieval. Consider, for
example, the senten e He has to investigate all the reports. Even the phrase may
be ambiguous due to the small ontext, still the results of a sear h based on su h a
senten e an be improved if the possible asso iations between the senses of the verb

investigate and the noun report are determined.


As a possible appli ation of the WSD method proposed here, we present a
system whi h improves the quality of the sear h on the Internet. The senses of the
words in the input question an be used to determine additional terms semanti ally

related to the original keywords, whi h in reases the number of possible relevant
do uments found by a sear h based on the input question.
A main problem with the urrent sear h engines is that many of the do uments
retrieved for general queries are totally irrelevant to the subje t of interest while
relevant do uments may be missing be ause the query does not ontain the exa t
keywords. A more re ned query, with more restri tive boolean operators, may result
in a few or even no do uments.
This alls for a natural language interfa e that transforms senten es into queries
with boolean operators urrently a epted by the sear h engines. One thing that an
be highly bene ial is to sear h not only for the words that o ur in the input senten e,
but to reate similarity lists with words from on-line di tionaries that have the same
meaning as the input words. This an signi antly broaden the web sear h.
Another area of improvement is to design new retrieval operators that retain
more relevant information. A possible approa h along this dire tion is still to use the
existent sear h engines, whi h were developed with great e orts, but to post-pro ess
the set of do uments produ ed to a query.
Figure 1.1 shows the way in whi h an Internet sear h system an be improved
by making use of NLP te hniques. The gure presents a omparison between two
ar hite tures of Internet sear h systems: the one to the left side of the gure represents
the basi s heme urrently used by the sear h engines; the one to the right side of
the gure des ribes an Internet sear h system in whi h NLP te hniques are involved.
The NLP stages and resour es are represented as dark-ba kground re tangles.
Given a natural language question, the improved system will perform two tasks:

It determines the senses of the keywords in the input question. Using the
information from WordNet, a similarity list is reated for ea h keyword, whi h
in lude words semanti ally related to it. These lists are used for query extension,
whi h is intended to in rease the number of relevant do uments found during a
sear h.
3

English sentence
WORDNET
LEXICAL PROCESSING:
(POS tagging, WSD)

Keywords Xi

Glosses from
WordNet

Keywords Xi

SIMILARITY LISTS FORMATION


Brills POS
Tagger
Similarity lists Si

BOOLEAN OPERATORS

LEXICAL RESOURCES

BOOLEAN OPERATORS

Query

Query

SEARCH ENGINE

SEARCH ENGINE

Documents (ranked)

SEARCH
ENGINES
(AltaVista)

Documents (ranked)

LEXICAL OPERATORS

Relevant paragraphs/sentences

BASIC INTERNET
SEARCH SYSTEM

INFORMATION
RETRIEVAL RESOURCES

INTERNET SEARCH SYSTEM


IMPROVED WITH NLP TECHNIQUES

Figure 1.1. A omparison between two Internet sear h systems ar hite tures: (left)
the basi Internet sear h system urrently used by sear h engines, and (right) an
Internet sear h system improved with NLP te hniques. The dark re tangles represent
stages in the sear h system performed with NLP methods.

The extended query is then used as an input to the urrent sear h engines to
retrieve do uments from the Internet. These do uments are further pro essed:
new lexi al operators are applied in order to fet h the relevant information.
This step in reases the pre ision of the sear h.
The thesis is organized as follows: Chapter 2 presents the resour es used by our

Word Sense Disambiguation method, as well as the resour es involved in the Internet
sear h system. The two kind of resour es, namely lexi al resour es and information
retrieval resour es, are presented in the se tions 2.1 and 2.2.
Next, the Word Sense Disambiguation method is des ribed in Chapter 3: se tion 3.1 des ribes our approa h for WSD; se tions 3.2 and 3.3 present the two
algorithms involved in this method; an example is presented in se tion 3.4, and
then the results obtained are summarized in se tion 3.5. Possible extensions of this
method are dis ussed in se tion 3.6.
An appli ation of this WSD method is to improve the quality of the sear h on
the Internet: Chapter 4 des ribes su h a natural language interfa e for sear hing the
Internet. We rst des ribe the problem en ountered while using the urrent sear h
engines; a solution to this problem is to make use of natural language te hniques: this
improved sear h system is des ribed in se tion 4.4. An example for the usage of this
system and the results we obtained are presented in se tions 4.5 and 4.6. Previous
work in this eld is summarized in se tion 4.2.
Finally, the on lusions regarding the work des ribed in this thesis are presented
in Chapter 5.

CHAPTER 2
BACKGROUND ON RESOURCES
Several resour es have been used in developing and testing the system des ribed
in this thesis. We an ategorize these into:
1. Lexi al resour es, in luding:
(a) WordNet, a Ma hine Readable Di tionary, in luding the glosses de ned in
this di tionary;
(b) SemCor, a semanti tagged orpus;
( ) Brill's part of spee h tagger.
2. Information Retrieval resour es
(a) AltaVista, a sear h engine for the Internet;
(b) TREC, the Text Retrieval Conferen e, whi h provides the resear hers with
a set of topi s with the purpose of testing systems designed for information
retrieval.
2.1. Lexi al resour es
The se ond step of the WSD method proposed here makes use of the information
provided by

WordNet

[13 semanti hierar hies and its

glosses

in order to distin-

guish between word senses. To test the a ura y a hieved in disambiguating words,
we used

SemCor

[32, a orpus in whi h words are tagged with their orrespondent

sense from WordNet.


6

Within the sear h system (as presented in hapter 4) we need to pro ess a
natural language question in order to identify its keywords and the main on epts.
To do that, we need to know the part of spee h of the words in the input senten e;
this part of spee h tagging is done using a version of Brill's Part of Spee h Tagger
[6. During the query expansion phase, we are also using WordNet.
2.1.1. WordNet
WordNet is a Ma hine Readable Di tionary developed at Prin eton University
by a group led by George Miller [30, [13. It is used by our system for WSD and
generation of similarity lists for query extension.
WordNet overs the vast majority of nouns, verbs, adje tives and adverbs from
the English language. The words in WordNet are organized in synonym sets, alled

synsets. Ea h synset represents a on ept. WordNet 1.6 has a large network of


129,504 words, organized in 98,548 synonym sets, alled synsets. Table 2.1 presents
the number of nouns, verbs, adje tives and adverbs de ned in WordNet.
Table 2.1. The number of words and on epts
in WordNet 1.6
Part of spee h
words on epts
noun
94,473
66,024
verb
10,318
12,156
adje tive
20,169
17,914
adverb
4,545
3,574
Total
129,504
98,548
There is a ri h set of 391,885 relation links among words, between words and
synsets, and between synsets [17. The basi semanti relation between words, en oded in WordNet, is the synonymy relation. The synsets are related by antonymy,

hypernymy/hyponymy (is-a) and meronymy/holonymy (part-whole) relations.


7

Figure 2.1 presents a snapshot from the WordNet semanti hierar hies.

(carnivore)
(fissiped mammal, fissiped)
(wolf)

(canine, canid)

(wild dog)

(dog)

(feline, felid) (bear)

(hyena, hyaena)

(hnting dog)
(dachshund, dachsie, badger dog)

(brown bear, bruin, Ursus arctos)

(working dog)
(terrier)

(procynoid)

(Syrian bear...)

(watch dog, guard dog)

(grizzly...)

(police dog)

Figure 2.1. A WordNet hierar hy


2.1.1.1. The glosses in WordNet
Almost all the synsets in WordNet have de ning glosses. A gloss onsists of a

de nition, omments and examples. For example, the gloss of the synset finterest,

interestingnessg is (the power of attra ting or holding one's interest (be ause it is
unusual or ex iting et .); "they said nothing of great interest"; "primary olors an
add interest to a room"). It has a de nition the power of attra ting or holding one's
interest, a omment be ause it is unusual or ex iting et . and two examples: they
said nothing of great interest and primary olors an add interest to a room. Some
glosses an ontain multiple de nitions or multiple omments.
2.1.2. SemCor
SemCor [32 is a textual orpus in whi h every word is linked to its appropriate
sense in WordNet. Thus it an be viewed either as a orpus in whi h words have been
tagged synta ti ally and semanti ally, or as a lexi on in whi h example senten es an
be found for many de nitions. The texts used to reate the semanti on ordan es are
extra ted from the Brown Corpus and then linked to senses in the WordNet lexi on.
The semanti tagging was done by hand, using various tools to annotate the text with
WordNet senses.
8

The semanti ally tagged les are grouped into three semanti on ordan es
based on what was tagged and when. Ea h semanti al on ordan e is stored in a
separate dire tory, as shown in Table 2.2.
Table 2.2. The semanti on ordan es in SemCor 1.6
Name
\brown1"
\brown2"
\brownv"

Content
103 Brown orpus les
83 Brown orpus les
166 Brown orpus les

What's tagged
all open lasses
all open lasses
verbs

The semanti ally tagged data is odi ed using SGML. The SGML mark-up
language has pairs of form attribute=value to spe ify: part of spee h, word sense,
paragraphs, senten es, et . Figure 2.2 presents a fragment from a SemCor le.

<contextfile concordance = brown>


<context filename=br-a01 paras=yes>
<p pnum=1>
<s snum=1>
<wf cmd=ignore pos=DT>The</wf>
<wf cmd=done rdf=group pos=NNP lemma=group wnsn=1 lexsn=1:03:00:: pn=group>
Fulton+County_Grand_Jury</wf>
<wf cmd=done pos=VB lemma=say wnsn=1 lexsn=2:32:00::>said</wf>

Figure 2.2. A fragment from SemCor


Consider as an example the entry for the word said, as presented in this gure.
It spe i es the part of spee h for this word as being VB (verb), the base form as being

say, and its sense, based on WordNet di tionary, as being sense #1.

2.1.3. Brill's part of spee h tagger


Part of spee h tagging is an important area of NLP; su h taggers are often used
in the pre-pro essing phase of many lexi al pro essing systems. The role of these
taggers is to assign parts of spee h to words; as this tagging is usually one of the
rst steps in performing a lexi al pro essing, it should be done with high a ura y,
in order to redu e the propagated error.
Brill's Part of Spee h Tagger [6 has been developed at University of Pennsylvania; it is a rule based system whi h automati ally a quires rules from already tagged
text, and then it applies them in the pro ess of assigning parts of spee h to words
en ountered in free text.
When we de ided to use this tagger in our system, we took into onsideration
its a ura y, resulting from tests performed by us against hand-tagged texts. We
onsidered 58 les from the World Street Journal olle tion, whi h have been manually
tagged within the Penn-Treebank proje t [25. The average size of these les was 323.1
words, respe tively 368.09 tokens, where the set of tokens in ludes the words and the
pun tuation. The total size was 18,738 words and 21,349 tokens.
Out of these 18,738 words, the tags assigned by Brill's Part of Spee h tagger
were the same as those manually assigned in 17,272 ases, thus a pre ision of 92.18%.
Out of the 21,349 tokens, 19,850 were tagged the same by Brill's tagger, i.e. 92.98%
a ura y. This pre ision proves this tool to be suitable for performing part of spee h
tagging with high a ura y.
2.2. Information Retrieval resour es
The purpose of the rst algorithm of the WSD method proposed in this thesis
is to gather statisti s from Internet regarding word-word o urren es. This task is
performed using

AltaVista

to sear h on the Internet. The same sear h engine is

used by our sear h system to fet h do uments whi h potentially in lude relevant

10

information respe t to the input question. We then tested our retrieval system using
50 questions derived from the topi s provided at the 6th Text Retrieval Conferen e
(TREC-6).
2.2.1. AltaVista
AltaVista [3 is a sear h engine developed in 1995 by the Digital Equipment
Corporation in its Palo Alto resear h labs. There are several hara teristi s of this
sear h servi e that makes AltaVista one of the most powerful sear h engines. In
hoosing AltaVista for use in our system, we based our de ision on two of its features:
(1) the size of information on Internet that an be a essed through AltaVista: it has
a growing index of over 160,000,000 unique World Wide Web pages; (2) it a epts
omplex boolean sear hes through its advan ed sear h fun tion. These features make
this sear h engine suitable for the development of software around it, with the goal
of in reasing the quality of the information retrieved.
Spe i relationships an be reated among the keywords of a query a epted
by AltaVista. These relations an be reated using bra kets,
N E AR

operators.

AN D

AN D

OR

N OT

and

nds only do uments ontaining all of the spe i ed words

or phrases. Mary AN D lamb nds do uments with both the word Mary and the word

lamb.
Mary

OR

nds do uments ontaining at least one of the spe i ed words or phrases.

OR

lamb nds do uments ontaining either Mary or lamb. The do uments

retrieved may ontain both words, but not ne essarily.

N E AR

nds do uments

ontaining both spe i ed words or phrases within 10 words of ea h other. Mary


N E AR

lamb nds do uments ontaining both the word Mary and the word lamb

but with the restri tion that these words are separated by maximum 10 other words.1
Our main on ern when we de ided to rely on AltaVista for sear hing do uments
on Internet, regarded the reliability of this sear h engine. The number of hits obtained
for a given query should vary only within a small range for sear hes performed at
1

These examples are from the AltaVista Advan ed Help


:

http

==www:altavista:digital: om=av= ontent=help advan ed:htm

11

various intervals of time. For the purpose of testing the reliability of Altavista, we
onsidered a set of 1,100 words (nouns, verbs, adje tives and adverbs); the set was
built from one of the texts in the Brown orpus. A test run onsisted of sear hing
on Internet using Altavista, for ea h of these words, and re ord the number of hits
obtained. We performed 20 tests, over o period of time of 10 days, a test run at every
12 hours. The overall results for these tests showed that, given AV as an average of
the number of hits for a parti ular word:

90% of the times the hits were in the range [0.99 x AV - 1.01 x AV

100% of the times the hits were in the range [0.85 x AV - 1.15 x AV
Taking into onsideration the size of the information found on Internet and the

fa t that this information is highly unstru tured, the small variations a hieved by
AltaVista in sear hing the Internet an lassify this sear h engine as a reliable one.
2.2.2.

TREC topi s

The Text Retrieval Conferen es (TREC) are part of the TIPSTER Program,
and are intended to en ourage resear h in information retrieval from large texts. The
information needs are des ribed by data stru tures alled topi s.
The TIPSTER proje t distinguishes between two di erent types of queries: ad

ho and routing. The ad ho queries are designed to investigate the performan e of


systems that sear h a set of do uments using novel topi s; these are most suitable for
systems implying spe i sear hes. The routing queries investigate the performan e
of systems that sear h new streams of do uments; the systems using this task usually address general sear hes; a routing query an be viewed as a lter on in oming
do uments.
As our sear h system is designed to improve the quality of the information
retrieved espe ially in the ase of spe i questions, we used the ad ho topi s in order
to test the performan e of our system. We derived 50 natural language questions from
the ad ho topi s provided at the 6th Text Retrieval Conferen es [43.
12

An example of a topi from the TREC-6 ad ho olle tion is presented in Figure


2.3. As it an be seen from this gure, a topi is a frame-liked data stru ture. Its
elds are to be interpreted as follows: the
<

num> se tion identi es the topi ; the

<

title> se tion lassi es the topi within a domain; the <des > se tion gives a brief

des ription of the topi (for TREC-6, this se tion was intended to be an initial sear h
query); the <narr> se tion provides a further explanation of what a relevant material
may look like.

<num> Number: 301


<titile> International Organized Crime
<desc> Description:
Identify organization that participate in international criminal activity, the activity, and, if possible, collaborating
organization and the countries involved.
<narr> Narrative:
A relevant document must as a minimum identify the organization and the type of illegal activity (e.g., Columbian
cartel exporting cocaine). Vague references to international drug trade without identification of the organization(s)
involved would not be relevant.

Figure 2.3. A TIPSTER topi


For the purpose of testing our system, we used the <des > eld to derive natural
language questions in a form similar to the questions normally used by users to sear h
the Internet. For example, from the orpus entry presented above, the question that
we derived is: \Whi h are some of the organizations parti ipating in international

riminal a tivity?" .
After retrieving the information using the derived questions, the relevan e of
the information has been evaluated based on the narrative se tion of ea h topi .

13

CHAPTER 3
WORD SENSE DISAMBIGUATION
3.1. A word-word dependen y approa h
The method presented here takes advantage of the senten e ontext. The words
are paired and an attempt is made to disambiguate one word within the ontext of the
other word. This is done by sear hing on Internet with queries formed using di erent
senses of one word, while keeping the other word xed. The senses are ranked simply
by the order provided by the number of hits. A good a ura y is obtained, perhaps
be ause the number of texts on the Internet is so large. In this way, all the words
are pro essed and the senses are ranked. We use the ranking of senses to urb the
omputational omplexity in the step that follows. Only the most promising senses
are kept.
The next step is to re ne the ordering of senses by using a ompletely di erent
method, namely the semanti density. This is measured by the number of ommon
words that are within a semanti distan e of two or more words. The loser the
semanti relationship between two words the higher the semanti density between
them. We introdu e the semanti density be ause it is relatively easy to measure it
on a MRD like WordNet. A metri is introdu ed in this sense whi h when applied to
all possible ombinations of the senses of two or more words it ranks them.
An essential aspe t of the WSD method presented here is that it provides a
raking of possible asso iations between words instead of a binary yes/no de ision for
ea h possible sense ombination. This allows for a ontrollable pre ision as other
modules may be able to distinguish later the orre t sense asso iation from su h a
small pool [27, [29.
14

3.2. Contextual ranking of word senses


Sin e the Internet ontains the largest olle tion of texts ele troni ally stored,
we use the Internet as a sour e of orpora for ranking the senses of the words.
3.2.1. Algorithm 1
Input: semanti ally untagged word1 - word2 pair (W1

W2

Output: ranking the senses of one word


Pro edure:
1.

Form a similarity list for ea h sense of one of the words. Pi k one of the words, say

W2

, and using WordNet, form a similarity list for ea h sense of that word. For this,

use the words from the synset of ea h sense and the words from the hypernym synsets.
Consider, for example, that W2 has m senses, thus W2 appears in m similarity lists:
(W21 ; W21(1) ; W21(2) ; :::; W21(k1 ) )
(W22 ; W22(1) ; W22(2) ; :::; W22(k2 ) )
...
(W2m ; W2m(1) ; W2m(2) ; :::; W2m(k ) )
m

where

1
W2

2
W2

, ...,

W2

are the senses of

W2

, and

i(s)

W2

represents the synonym

number s of the sense W2i as de ned in WordNet.


2.

Form W1 - W2i(s) pairs. The pairs that may be formed are:

(W1
(W1

1
W2 ; W1

1(1)
W2
; W1

1(2)
W2
; :::; W1

1(k1 )
W2

2
W2 ; W1

2(1)
W2
; W1

2(2)
W2
; :::; W1

2(k2 )
W2

...
(W1
3.

W2 ; W1

m(1)

W2

; W1

m(2)

W2

; :::; W1

m(km )

W2

Sear h the Internet and rank the senses W i(s) . A sear h performed on the Internet

for ea h set of pairs as de ned above, results in a value indi ating the frequen y of
o urren es for W1 and the sense of W2 . In our experiments we used AltaVista [3
sin e it is one of the most powerful sear h engines urrently available. Using the

15

operators provided by AltaVista, query-forms are de ned for ea h

W1

i(s)

W2

set

above:
(a) (\W1 * W2i *" OR \W1 * W2i(1) *" OR \W1 * W2i(2) *" OR ... OR \W1 * W2i(k ) *")
i

(b) ((W1 * NEAR W2i*) OR (W1 * NEAR W2i(1) *) OR (W1 * NEAR W2i(2) *) OR ... OR
(W1 * NEAR W2i(k ) *))
i

for all 1

 
i

. The asterisk (*) is used as a wild- ard to in rease the number

of hits with morphologi ally related words. Using one of these queries, we get the
number of hits for ea h sense i of the noun and this provides a ranking of the m senses
of W2 as they relate with W1 .
A similar algorithm is used to rank the senses of W1 while keeping W2 onstant
(not-disambiguated). Sin e these two pro edures are done over a large orpora (the
Internet), and with the help of similarity lists, there is little orrelation between the
results produ ed by the two pro edures.
3.2.2. Pro edure Evaluation
This method was tested on 384 pairs: 200 verb-noun ( le br-a01, br-a02), 127
adje tive-noun ( le br-a01), and 57 adverb-verb ( le br-a01), extra ted from SemCor
1.6 of the Brown orpus. Using query form (a) on Alta Vista, we obtained the results
shown in Table 3.1. The table indi ates the per entages of orre t senses (as given
by SemCor) ranked by us in top 1, top 2, top 3, and top 4 of our list.
We on luded that by keeping the top four hoi es for verbs and nouns and the
top two hoi es for adje tives and adverbs, we over with high per entage (mid and
upper 90's) all relevant senses. Looking from a di erent point of view, the meaning of
the pro edure so far is that it ex ludes the senses that do not apply, and this an save
a onsiderable amount of omputation time as many words are highly polysemous.

16

We also used the query form (b), but the results obtained were similar; using the
operator N E AR, a larger number of hits is reported, but the sense ranking remains
more or less the same.
Table 3.1. Statisti s gather from the Internet for 384 word pairs
top 1 top 2 top 3 top 4
noun
76% 83% 86% 98%
verb
60% 68% 86% 87%
adje tive 79.8% 93%
adverb
87% 97%
3.3. Con eptual density algorithm
A measure of the relatedness between words an be a knowledge sour e for
several de isions in NLP appli ations. The approa h we take here is to onstru t a
linguisti ontext for ea h sense of the verb and noun, and to measure the number
of the ommon nouns shared by the verb and the noun ontexts. In WordNet ea h
on ept has a gloss that a ts as a mi ro- ontext for that on ept. This is a ri h
sour e of linguisti information that we found useful in determining on eptual density
between words.
3.3.1. Algorithm 2
Input: semanti ally untagged verb - noun pair and a ranking of noun senses (as
determined by Algorithm 1)
Output: sense tagged verb - noun pair
Pro edure:
1.

Given a verb-noun pair V

, denote with < v1 ; v2 ; :::; vh > and < n1 ; n2 ; :::; nl >

the possible senses of the verb and the noun using WordNet.

17

2.

Using Algorithm 1, the senses of the noun are ranked. Only the rst t possible

senses indi ated by this ranking will be onsidered. The rest are dropped to redu e
the omputational omplexity.
3.

For ea h possible pair vi

nj

, the on eptual density is omputed as follows:

(a) Extra t all the glosses from the sub-hierar hy in luding

vi

(the rationale for se-

le ting the sub-hierar hy is explained below)


(b) Determine the nouns from these glosses. These onstitute the noun- ontext of
the verb. Ea h su h noun is stored together with a weight w that indi ates the level
in the sub-hierar hy of the verb on ept in whose gloss the noun was found.
( ) Determine the nouns from the noun sub-hierar hy in luding nj .
(d) Determine the on eptual density

Cij

of ommon on epts between the nouns

obtained at (b) and the nouns obtained at ( ) using the metri :

j dij j

Cij

wk

log des endentsj

(1)

where:

 j ij j is the number of ommon on epts between the hierar hies of


d

wk

des endentsj

4. Cij

vi

and nj

are the levels of the nouns in the hierar hy of verb vi


is the total number of words within the hierar hy of noun nj

ranks ea h pair vi

nj

, for all i and j .


3.3.2. Rationale

1. In WordNet, a gloss explains a on ept and provides one or more examples


with typi al usage of that on ept. In order to determine the most appropriate noun
and verb hierar hies, we performed some experiments using SemCor and on luded
that the noun sub-hierar hy should in lude all the nouns in the lass of

nj

. The

sub-hierar hy of verb

hi

of the

vi

is taken as the hierar hy of the highest hypernym

verb vi . It is ne essary to onsider a larger hierar hy then just the one provided by
18

synonyms and dire t hyponyms. As we repla ed the role of a orpora with glosses,
better results are a hieved if more glosses are onsidered. Still, we do not want to
enlarge the ontext too mu h.
2. As the nouns with a big hierar hy tend to have a larger value for

j ij j,
d

the weighted sum of ommon on epts is normalized with respe t to the dimension
of the noun hierar hy. Sin e the size of a hierar hy grows exponentially with its
depth, we used the logarithm of the total number of des endants in the hierar hy, i.e.
(

log des endentsj

).

3. We also took into onsideration and have experimented with a few other metri s. But after running the program on several examples, the formula from Algorithm
2 provided the best results.
3.4. An example
As an example, let's onsider the verb-noun ollo ation revise law. The verb

revise has two possible senses in WordNet 1.6 and noun law has seven senses.
First, Algorithm 1 was applied and sear h the Internet using Alta Vista, for all
possible pairs V-N that may be reated using revise and the words from the similarity
lists of law. The following ranking of senses was obtained: law#2(2829), law#3(648),

law#4(640), law#6(397), law#1(224), law#5(37), law#7(0), where the numbers in


parentheses indi ate the number of hits. By setting the threshold at t = 2, we keep
only senses #2 and #3.
Next, Algorithm 2 is applied to rank the four possible ombinations (two for
the verb times two for the noun). The results are summarized in Table 3.4: (1)

j ij j - the number of ommon on epts between the verb and noun hierar hies; (2)
d

des endantsj

the total number of nouns within the hierar hy of ea h sense nj ; and (3)

the on eptual density

Cij

for ea h pair ni

above.

19

vj

derived using the formula presented

The largest on eptual density C12 = 0:30 orresponds to v1


law

#2=5 (the notation #i=n means sense i out of

n2

revise

#1=2

possible senses given by Word-

Net). This ombination of verb-noun senses also appears in SemCor, le br-a01.


Table 3.2. Values used in omputing the on eptual density and the on eptual density
Cij

j ij j
d

n2
v1
v2

5
0

n3

4
0

des endantsj
n2

n3

975
975

1265
1265

Cij
n2

n3

0.30 0.28
0
0

3.5. Evaluation and omparison with other methods


3.5.1. Tests against SemCor
The method was tested on 384 pairs sele ted from the rst two tagged les
of SemCor 1.6 ( le br-a01, br-a02). From these, there are 200 verb-noun pairs, 127
adje tive-noun pairs and 57 adverb-verb pairs.
In Table 3.5.1, we present a summary of the results.
Table 3.3. Final results obtained for 384
word pairs using both algorithms.
top 1 top 2 top 3 top 4
noun
86.5% 96% 97% 98%
verb
67% 79% 86% 87%
adje tive 79.8% 93%
adverb
87% 97%

20

Dis ussion of results

When evaluating these results, one should take into onsideration that:
1. Using the glosses as a base for al ulating the on eptual density has the advantage
of eliminating the use of a large orpus. But a disadvantage that omes from the use
of glosses is that they are not part-of-spee h tagged, like some orpora are (i.e. Treebank). For this reason, when determining the nouns from the verb glosses, an error
rate is introdu ed, as some verbs (like make, have, go, do) are lexi ally ambiguous
having a noun representation in WordNet as well. We believe that future work on
part-of-spee h tagging the glosses of WordNet will improve our results.
2. The determination of senses in SemCor was done of ourse within a larger ontext,
the ontext of senten e and dis ourse. By working only with a pair of words we do
not take advantage of su h a broader ontext. For example, when disambiguating the
pair prote t ourt our method pi ked the ourt meaning \a room in whi h a law ourt

sits" whi h seems reasonable given only two words, whereas SemCor gives the ourt
meaning \an assembly to ondu t judi ial business" whi h results from the senten e
ontext (this was our se ond hoi e). In the next se tion we extend our method to
more than two words disambiguated at the same time.
3.5.2. Comparison with other methods
As indi ated in [38, it is di ult to ompare the WSD methods, as long as distin tions reside in the approa h onsidered (MRD-based methods, supervised or unsupervised statisti al methods), and in the words that are disambiguated. A method
that disambiguates unrestri ted nouns, verbs, adverbs and adje tives in texts is presented in [42; it attempts to exploit sentential and dis ourse ontexts and is based on
the idea of semanti distan e between words, and lexi al relations. It uses WordNet
and it was tested on SemCor.
They report an average a ura y of 85.7% for nouns, 63.9% for verbs, 83.6% for
adje tives and 86.5% for adverbs, slightly less than our results.

21

Moreover, for appli ations su h as information retrieval we an use more than


one sense ombination; if we take the top 2 ranked ombinations our average a ura y
is 91.5% (from Table 3).
Other methods that were reported in the literature disambiguate either one part
of spee h word (i.e. nouns), or in the ase of purely statisti al methods, fo us on a
very limited number of words. Some of the best results were reported by Yarowsky
in [45 who use a large training orpus. For the noun drug, Yarowsky obtains 91.4%
orre t performan e and when onsidering the restri tion \one sense per dis ourse"
the a ura y in reases to 93.9%.
3.6. Extensions
3.6.1.

Noun-noun and verb-verb pairs

The method presented here an be applied in a similar way to determine the


on eptual density within noun-noun pairs, or verb-verb pairs. In these ases, the
rst algorithm has to be modi ed: the N E AR operator has to be used to form word
pairs for sear hing the Internet.
3.6.2. Larger window size
We have extended the disambiguation method to more than two words oo urren es. Consider for example:

The bombs aused damage but no injuries.


The senses spe i ed in SemCor, are:

1a. bomb(#1/3) ause(#1/2) damage(#1/5) injury(#1/4)


For ea h word

, we onsidered all possible ombinations with the other words

from the senten e, two at a time. The on eptual density


ombinations X

was omputed for the

as a summation of the on eptual densities between the sense i

of the word X and all the senses of the words Y .

22

The results are shown the Tables 3.4, 3.5, 3.6.2 and 3.6.2 below, where the
on eptual density al ulated for the sense #i of word X is presented in the olumn
denoted by C #i:
Table 3.4. CD for bomb- pairs
#1
bomb- ause
0.57
bomb-damage 5.09
bomb-injury 2.69
8.35
X

#2

0
0.13
0.15
0.28

0
0
0
0

#3

Table 3.5. CD for ause- pairs


#1
ause-bomb
5.16
ause-damage 12.83
ause-injury 12.63
30.62
X

#2
1.34
2.64
1.75
5.73
d

By sele ting the largest values for the on eptual density, the words are tagged
with their senses as follows:

1b. bomb(#1/3) ause(#1/2) damage(#1/5) injury(#2/4)


Note that the senses for word injury di er from 1a. to 1b.; the one determined
by our method (#2/4) is des ribed in WordNet as \an a ident that results in physi al

damage or hurt" (hypernym: a ident), and the sense provided in SemCor (#1/4) is
de ned as \any physi al damage"(hypernym: health problem).
This is a typi al example of a mismat h aused by the ne granularity of senses
in WordNet whi h translates into a human judgment that is not lear ut. We think
that the sense sele tion provided by our method is justi ed, as both damage and injury
are obje ts of the same verb ause; the relatedness of damage(#1/5) and injury(#2/4)
23

is larger, as both are of the same lass noun.event as opposed to injury(#1/4) whi h
is of lass noun.state.
Some other randomly sele ted examples onsidered were:

2a. The terrorists(#1/1) bombed(#1/3) the embassies(#1/1).


2b. terrorist(#1/1) bomb(#1/3) embassy(#1/1)
3a. A ar-bomb(#1/1) exploded(#2/10) in front of PRC(#1/1) embassy(#1/1).
3b. ar-bomb(#1/1) explode(#2/10) PRC(#1/1) embassy(#1/1)
4a. The bombs(#1/3) broke(#23/27) windows(#1/4) and destroyed(#2/4)
the two vehi les(#1/2).
4b. bomb(#1/3) break(#3/27) window(#1/4) destroy(#2/4) vehi le(#1/2)
Table 3.6. CD for damage- pairs
#1
damage-bomb 5.60
damage- ause 1.73
damage-injury 9.87
17.20
X

#2
2.14
2.63
2.57
7.34
d

#3
1.95
0.17
3.24
5.36
d

#4
0.88
0.16
1.56
2.60
d

#5
2.16
3.80
7.59
13.55
d

Table 3.7. CD for injury- pairs


#1
injury-bomb
2.35
injury- ause
0
injury-damage 5.05
7.40
X

#2
5.35
4.48
10.40
20.23
d

#3
0.41
0.05
0.81
1.27
d

#4
2.28
0.01
9.69
11.98
d

where senten es 2a, 3a and 4a are extra ted from SemCor, with the asso iated senses
for ea h word, and senten es 2b, 3b and 4b show the verbs and the nouns tagged with
their senses by our method. The only dis repan y is for the word broke and perhaps

24

this is due to the large number of its senses. The other word with a large number of
senses explode was tagged orre tly, whi h was en ouraging.
Considering only pairs of two words, the words in the 4 senten es have been
tagged as it follows:

1 . bomb(#1/3) ause(#1/2) damage(#5/5) injury(#2/4)


2 . terrorist(#1/1) bomb(#1/3) embassy(#1/1)
3b. ar-bomb(#1/1) explode(#1/10) PRC(#1/1) embassy(#1/1)
4b. bomb(#1/3) break(#1/27) window(#1/4) destroy(#4/4) vehi le(#1/2)
Out of these 16 words, 14 have been orre tly tagged when a larger window
size has been onsidered, and only 12 words have been orre tly disambiguated when
pairs of 2 words were onsidered. Thus, for this ase, the in rease in a ura y was of
16% when multiple words are onsidered in the disambiguation pro ess.
3.7. Con lusion
Word Sense Disambiguation is one of the most di ult tasks in NLP; even
more di ulties arise when senses are identi ed using a ne grain Ma hine Readable
Di tionary like WordNet. The method proposed here provides a ranking of senses,
rather than a single orre t sense. It ombines two algorithms: rst of all, statisti s
are gathered from the Internet to indi ate the possible ombinations of words; then,
a semanti density measure is used to al ulate the relatedness between words.
In the next hapter, we des ribe a possible appli ation of the WSD method
presented here. As we will show, the pro ess of sear hing the Internet an be done
with better results when using Natural Language Pro essing te hniques. In order to
get relevant information from the Internet related to a given input question, we have
rst to pre-pro ess the question; this pre-pro essing phase in ludes a WSD step, in
whi h a sense is assigned to ea h word; this information will be ne essary during the
query expansion phase.

25

CHAPTER 4
A NATURAL LANGUAGE INTERFACE FOR
SEARCHING THE INTERNET
4.1. Sear hing the Internet: NLP solutions for its drawba ks
Some of the problems of the urrent sear h engines have been already pointed
out in Chapter 1. These in lude the fa t that many of the do uments retrieved for
general queries are totally irrelevant to the subje t of interest and relevant do uments
may be missing be ause the query does not ontain the exa t keywords. On the other
hand, more re ned query, with more restri tive boolean operators, may result in a
few or even no do uments.
Another problem is that typi ally the users are non- omputer professionals;
they tend to use natural language questions, whi h are far more omplex than the
simple words or ombinations of words a epted by the sear h engines. For example,
for nding the presidents of the United States during the last entury, a ommon
user will ask \Who were the US Presidents of the last entury?", rather than using
a query formed with boolean operators, su h as (US NEAR Presidents) AND (last

NEAR entury).
In this thesis we examine some of the bene ts of using Natural Language Pro essing in onjun tion with WordNet, to improve the quality of the results. Spe ifi ally, the thesis des ribes a system that addresses two issues: (1) the translation
of a natural language question or senten e into a query and query extension using
WordNet, and (2) extra tion of paragraphs that render relevant information from the
do uments fet hed by the sear h engines.
The performan e of a sear h engine is measured based on both the relevan e
of the information retrieved and the number of relevant do uments retrieved. The
26

evaluation methodology an be based on two fa tors: the pre ision and the resolution.
Usually, the pre ision is de ned as the ratio between the number of relevant do uments
retrieved over the total number of do uments retrieved. In the ase of sear hing the
Internet, this parameter is hard to be omputed, as it is very di ult to he k all the
do uments returned by a sear h, in order to determine the relevant ones. Generally, a
user does not he k more than 10 do uments, as they are returned by a sear h engine;
for this reason, we rede ned this fa tor su h as it measures the number of relevant
do uments found in the top 10 hits. In this thesis, from now on, the term pre ision
will refer to this new metri , while the term full-pre ision will refer to the old meaning
of pre ision, i.e. the number of relevant do uments retrieved over the total number of
do uments retrieved. Another important fa t whi h has to be determined is whether
or not a user found an answer to her question. For this purpose, we de ned a new
parameter, alled resolution, whi h represents the number of answered questions from
a set of posed questions.
The rst step of our method, namely the query extension phase, is intended to
in rease the number of possibly relevant do uments retrieved by a sear h, while the
se ond step of fet hing relevant information in reases the pre ision. Both these steps
are intended to in rease the resolution [33.
4.2. Previous work
Two main approa hes have been onsidered by resear hers in trying to improve
the quality of the sear h on Internet or large olle tions of texts.
The rst one is to make use of multiple sear h engines and reate a meta sear h
engine [41, [16. This will result in an in reased number of do uments, as they are
retrieved based on the information stored in multiple sear h engine databases. The
hard task in this approa h is that di erent sear h engines are largely in ompatible and
do not always allow for interoperability. Solving this problem implies a uni ation
of both the query language and the type of results returned by the di erent sear h
engines.
27

The se ond approa h is to use NLP te hniques. Here, work has been developed
in two dire tions. (1) The use of query extension te hniques to in rease the number
of do uments retrieved. (2) Improve the quality of the information retrieved using
NLP-based systems: REASON [4, INQUIRY [11.
Query expansion has been proved to have positive e e ts in retrieving relevant
information [24. The purpose of query extension an be either to broaden the set
of do uments retrieved or to in rease the retrieval pre ision. In the former ase, the
query is expanded with terms similar with the words from the original query, while
in the se ond ase the expansion pro edure adds ompletely new terms.
There are two main te hniques used in expanding an original query. The rst
one onsiders the use of Ma hine Readable Di tionary; [44 and [2 are making use of
WordNet to enlarge the query su h as it in ludes words whi h are semanti ally related
to the on epts from the original query. The basi semanti relation used in their
systems is the synonymy relation; still, these te hniques allow a further extension of
the query, by using other semanti relations whi h an be derived from a MRD, like
for example the hypernymy and hyponymy relations.
A se ond te hnique onsidered by resear hers for query expansion is to use words
derived from relevant do uments. The SMART system, [8, developed at Cornell University, does massive query expansion, adding from 300 to 530 terms to ea h query,
terms whi h are derived from relevant do uments. They report a pre ision improvement of 7% to 25% obtained during their experiments. Another method is proposed
by Ishikawa in [20; the original query is extended with words from paragraphs whi h
are onsidered to be relevant, based on a similarity measure between the paragraphs
and the original query. In [24, Lu evaluates the performan e obtained during the
experiments performed with the TIPSTER olle tion. They prove that larger queries
an in rease the full-pre ision within a range from 0% to 20%.

28

4.3. Types of questions


The a tion of asking a question is based on someone's need to in rease his or
her knowledge about a parti ular domain. This pro ess of looking for information is
a omponent of a more general learning pro ess. The questions one might pose an
be lassi ed based on the information needed to formulate an answer, as well as on
the level of di ulty in nding su h an answer.
We present below a lassi ation of types of questions. It is a tually a transformation from Bloom's taxonomy [5, whi h was often used in learning studies; it
represents types of questions whi h are asked by people in the pro ess of looking for
information.
There are six lasses of questions, with level of di ulty in reasing from level 1
to level 6.
Class 1

Knowledge. The answers to these questions an be found by identifying information.

Types Who...?, What...?, When...?, Where...?, How...?.


Example Who is the President of Fran e?
Class 2

Comprehension The formulation of an answer to this kind of questions involves a


sele tion of fa ts and ideas, and eventually a summarization of these fa ts.

Types Retell the story of...


Example Des ribe the events in Italy during the Se ond War.
Class 3

Appli ation These questions are more di ult than the previous ones, in the way
that the information obtained has to be applied in order to obtain a result.

Types How is ... and example of ...?, How is ... related to ... ?.
Example Why are the omputers important?

29

Class 4

Analysis This lass of questions involves a subdivision of obje ts/fa ts, to show how
they are put together. The answer to these questions implies a tually a separation of
the whole into omponents.

Types Classify ... a ording to ..., What eviden e an you list for...?.
Example What are the features of a mammal?
Class 5

Synthesis This is the opposite of the previous type. Here, the ideas and fa ts have to
be ompound to reate a new whole.

Types What would you predi t from...?, What solutions would you suggest for... ?
Example How would you design a new omputer?
Class 6

Evaluation This is the hardest type of questions, as it involves making de isions about
issues and developing opinions or judgments.

Types Do you agree...?, What do you think about...?


Example What is the most important ommer ial ity in Austria?
From these lasses, the easier lass, i.e. lass 1, is predominant: studies performed on the type of questions usually asked by people proved that 80% of the
questions involve only identi ation of information, rather than a deeper analysis of
fa ts and ideas. The system des ribed in this thesis addresses the task of nding an
answer for this rst lass of questions. Finding answers to the other types of questions
is mu h harder, and it should involve text understanding, summarization and other.
4.4. System ar hite ture
The system ar hite ture is shown in Figure 4.1. An input query or senten e
expressed in English is rst presented to the lexi al pro essing module that extra ts
the keywords from the senten e. The query formation module uses these keywords
to form queries that are sent to one or more sear h engines. The do uments fet hed
by the sear h engines are ltered with the help of some new lexi al operators.
30

English sentence
WORDNET
LEXICAL PROCESSING:
(POS tagging, WSD)
Glosses from
WordNet

Keywords Xi

SIMILARITY LISTS FORMATION


Brills POS
Tagger
Similarity lists Si

BOOLEAN OPERATORS

LEXICAL RESOURCES

Query

SEARCH ENGINE

Documents (ranked)

SEARCH
ENGINES
(AltaVista)

LEXICAL OPERATORS

Relevant paragraphs/sentences

INFORMATION
RETRIEVAL RESOURCES

INTERNET SEARCH SYSTEM


IMPROVED WITH NLP TECHNIQUES

Figure 4.1. System organization

31

4.4.1. Lexi al pro essing


This module has been adopted from an information extra tion system developed
by us for the MUC ompetition [34. First the senten e boundaries are lo ated. It
then pla es part of spee h tags on words by using a version of Brill's tagger [6
in onjun tion with WordNet. It also ontains a phrase parser that segments ea h
senten e into onstituent noun and verb phrases and re ognizes the head words. After
the elimination of the stop-words ( onjun tions, prepositions, pronouns and modal
verbs) we are left with some keywords

xi

that represent the important on epts of

the input senten e.


The WSD method presented in Chapter 3 is used to identify the orre t sense
of these keywords. This is an important step of the lexi al pro essing phase, as
knowing the sense of ea h word will further enable to expand the query with words
semanti ally related to the original keywords.
4.4.2. Query formulation
The two main fun tions performed by this module are: 1) the onstru tion of
similarity lists using WordNet, and 2) the a tual query formation.
On e we have a sense ranking of ea h word in the input senten e, it is relatively
easy to use the ri h semanti information ontained in WordNet to identify many
other words that are semanti ally similar to a given input word. By doing this we
in rease the han e of nding more answers to input queries. WordNet an provide
semanti similarity between on epts at various levels. Here are three levels that may
be onsidered in des ending order of interest.

Level 1. Words semanti ally similar, i.e. words belonging to the same synset.
Level 2. Words that express on epts linked by semanti relations; i.e. hyponymy/
hypernymy, meronymy/ holonymy and entailment.

Level 3. Words that belong to sibling on epts; namely on epts that are subsumed
by the same on ept.
32

Consider, for example, the sense number 1 of the noun operation. There are
eight senses of this word in WordNet. The synset for the rst sense in ludes only
the word itself. The hypernym synset for this rst sense in ludes three words a -

tion, a tivity and a tiveness. The sybling synsets are fagen yg, fplayg, fbusynessg,
fbehavior, behaviourg, feruption, eru tationg and fswingg.
The similarity lists that we an now reate, orresponding to the three levels
de ned above, are:

L1. foperationg
L2. foperation, a tion, a tivity, a tivenessg
L3. foperation, a tion, a tivity, a tiveness, agen y, play, busyness, behavior, behaviour, eruption, eru tation, swingg

Several experiments have been performed in order to measure the performan e


a hieved using di erent levels of word similarities. Con lusions drawn from experiments on small olle tions of texts [40 showed that expansion by synonyms improved
the performan e, but expansion by broader or narrower terms, sele ted from a hierar hi al thesaurus did not prove to be very useful.
In [44, Voorhees investigated the e a y of expanding a query for sear h in
large text olle tions. The work developed during this investigation uses WordNet and
experiments for four expanding strategies: expansion by synonyms only, expansion
by synonyms plus all des endents in a isa hierar hy, expansion by synonyms plus
parents and all des endents in a isa hierar hy, and expansion by synonyms plus any
synset dire tly related to the given synset. The results have shown that there are no
signi ant di eren es between the pre ision obtained while using the four expanding
strategies.
These experiments performed previously by resear hers, and several other tests
that we performed, drove us to the on lusion that no important improvement an be
a hieved by using broader or narrower terms. Thus, the results that we are presenting
in lude only the Level 1 similarity.
33

Let's denote with

xi

xi

the words of a question or senten e, and with

g the similarity lists provided by WordNet for ea h word

list are

xi

where

xi

Wi

xi

. The elements of a

enumerates the elements in ea h list, i.e. words on the same

level of similarity with the word xi . These lists an now be used for the a tual query
formulation, using the boolean operators a epted by the urrent sear h engines. The
OR

operator is used to link words within a similarity list

N E AR

Wi

, while the

AN D

and

N E AR

op-

operators link the similarity lists.

While di erent ombinations of similarity lists linked by

AN D

or

erators are possible, there are two basi forms giving the maximum, respe tively the
minimum, of the number of do uments retrieved:
(1) W1 AND W2 AND ... AND Wn
(2) W1 NEAR W2 NEAR ... NEAR Wn
In most of the ases, the format (1) gathered thousands of do uments, while the
format (2) had almost always null results.
The on lusion so far is that the do uments ontaining the answers, if any, must
be among the large number of do uments provided by the AND operators. However,
the sear h engines failed to rank them in the top of the list. Thus, we sought to nd
new operators that ltered out many of the irrelevant texts.
4.4.3. New operators
Our approa h to ltering do uments is to rst sear h the Internet using weak
operators (AND, OR) and then to further sear h this large number of do uments using
more powerful operators. For this se ond phase, we propose the following additional
operators:
PARAGRAPH n (... similarity lists ... )

The PARAGRAPH operator sear hes like an AND operator for the words in the
similarity lists with the onstraint that the words belong only to some n onse utive
paragraphs. The rationale is that most likely the information requested is found in a
34

few paragraphs rather than being dispersed over an entire do ument. A similar idea
an be found in [10.
SENTENCE n (... similarity lists ... )

The SENTENCE operator sear hes like an AND operator for the words in the similarity lists with the onstraint that the words belong to a senten e. The answers to
many queries are found in single, sometimes omplex senten es.
In order to apply these new operators, the do uments gathered from the Internet
have to be segmentated into senten es and paragraphs, respe tively. Separating a text
into senten es proves to be an easy task, one ould just make use of the pun tuation
to solve this problem. Instead, paragraph segmentation is mu h more di ult, and
this is due rst of all to the highly unstru tured texts that an be found on the Web.
Work developed in this dire tion is presented in [18 and [10, but these methods work
only for stru tured texts, with a priori known lexi al separators (i.e. a tag, an empty
line et .). Thus, we had to use a method that overs almost all the possible paragraph
separators that an o ur in the texts on the web. The paragraph separators that we
onsidered so far are: (1) HTML tags; (2) empty lines; (3) paragraph indentation.
4.5. An example
The system operation is presented below with the help of an example. Suppose
one wants to nd the answer to the question: \How mu h tax does an average salary

person pay in the United States?"


The linguisti pro essing module identi ed the following keywords:
x1

=(tax), pos = noun, sense #1/1

x2

=(average), pos = adje tive, sense #4/5

x3

=(salary), pos = noun, sense #1/1

x4

=(the United States), pos = noun, sense #1/2

x5

=(person), pos = noun, sense #1/3

x6

=(pays), pos = verb, sense #1/7


35

In the notation above \pos" means part of spee h, and the sense number indi ates the a tual WordNet sense that resulted from the disambiguation out of all
possible senses in WordNet. For instan e adje tive average has 5 senses and the system pi ked sense #4. Note that the senses of words in WordNet are ranked in the
order of their utilization frequen y in a large orpora.
These keywords will be the input for the next step of our system, ex ept those
keywords having a too high position in the WordNet hierar hies. In our example
onsidering only Level 1 similarity, and only the rst four keywords WordNet provides:
W1
W2
W3
W4

= ftax, taxation, revenue enhan ementg

= faverage, intermediate, medium, middleg

= fsalary, wage, pay, earnings, remunerationg

= fUnited States, United States of Ameri a, Ameri a, US, U.S., USA, U.S.A.g
Table 4.1. Queries with various ombinations of operators

1
2
3
4
5
6
7

Query
W1 AND W2 AND W3 AND W4
W1 AND (W2 NEAR W3 ) AND W4
W1 NEAR (W2 NEAR W3 ) AND W4
W1 NEAR W2 NEAR W3 NEAR W4
W1 AND faverage W3 g AND W4
W1 AND faverage W3 g NEAR W4
W1 NEAR faverage W3 g NEAR W4

Number of
do uments
15,464
3,217
803
0
1752
1 (no)
0

These lists are used to formulate queries for the sear h engine. As we will see,
the operators available today for the sear h engines are not adequate to provide the
desired answers in most of the ases. Table 4.1 shows some queries and the number
of do uments provided by AltaVista, onsidered to be one of the sear h engines with
the most powerful set of operators available today.
36

The ranking provided by the AltaVista is of no use for us here. None of the leading do uments in any ategory provides the desired information. The only do ument
fet hed by Query 6 is equally irrelevant:

....Instead, their plans would shift more of the total tax burden on to labor, taxing
apital on e under a business tax, and taxing wages and salaries twi e under both the
in ome tax and the payroll tax. Middle- lass Ameri ans have to pay more under su h
a system, and wealthy people mu h less....
....The average taxpayer must work 86 days to pay all federal taxes, and must work
36 days just to pay his or her federal in ome tax. The average Ameri an must work
2 hours and 49 minutes every working day to pay all their taxes.....
An analysis of the results in table above indi ates that there is a gap in the
volume of do uments retrieved with the Alta Vista operators. For instan e using
only the AND operator (Query 1) 15,464 do uments were obtained, but the NEAR
operator (Query 4) produ ed no output. This operator seems to be too restri tive,
while it fails to identify the right answer. Various ombinations of AND and NEAR
operators were tried, as indi ated by the table above with no great results.
Using the PARAGRAPH operator for the example above, the system found a relevant
answer:

In 1910, Ameri an workers paid no in ome tax. In 1995, a worker earning an


average wage of $26,000 pays about 24% (about $6,000) in in ome taxes. The average
Ameri an worker's pay has risen greatly sin e 1910. Then, the average worker earned
about $600 per year. Today, the gure is $26,000.
4.6. Results
In [22, Leong lassi es the queries as on rete queries and abstra t queries. In
this lassi ation, the on rete queries are de ned as queries based on examplars,
while the abstra t queries are those based on des riptions. The TREC topi s are
asso iated with abstra t queries, while users are usually sear hing using on rete
queries.
37

For the purpose of testing our system, we onsidered 50 abstra t questions derived from the des riptive se tion of ea h of the 50 topi s in the TREC-6 set and 50

on rete questions used by users to sear h the Internet; let us denote these sets as
the TREC set, respe tively the REAL set. Table 4.6 presents ve randomly sele ted
questions from the TREC set and ve questions from the REAL set, together with
the results we obtained.
Table 4.2. A sample of the results obtained for randomly sele ted questions
from the TREC and the REAL sets.
Question

AND

NEAR

AND

NEAR

Paragraph

Senten e

27,716
0
9,432
1
178
1
32,133
0
246
0

3
1
13
3
4
0
6,214
1
5
1

48,133
0
10,271
2
504
1
32,133
0
260
1

5
1
15
3
4
0
6,214
1
5
1

6
1
40
11
2
1
150
80
15
6

0
0
3
2
0
0
5
2
1
0

1,360
2
2
0
4503
0
36049
0
6
1

3
3
0
0
202
0
858
1
0
0

2608
2
30
1
10221
0
36049
0
70
0

35
5
0
0
575
1
858
1
0
0

61
34
10
1
117
10
15
8
6
6

6
3
0
0
1
0
0
0
1
0

xi

TREC questions

Whi h are some of the organizations parti ipating in


in international riminal a tivity?
Is the disease of Poliomyelitis (polio) under ontrol
in the world?
Whi h are some of the positive a omplishments of the
Hubble teles ope sin e it was laun hed?
Whi h are some of the endangered mammals?
Whi h are the most rashworthy, and least rashworthy,
passenger vehi les?
REAL questions
Where an I nd heap airline fares?
Find out about Fifths disease.
What is the pri e of ICI?
Where an I shop online for Canada??
What are the average wages for event planners?

xi

wi

wi

wi

wi

Ea h ell in this table in ludes two numbers: the upper one represents the total number of do uments retrieved for the question, respe tively the total number of
paragraphs retrieved when the PARAGRAPH operator was used. The bottom number represents the number of relevant do uments found in top 10 ranking, respe tively
the number of relevant paragraphs.
The AND

xi

and NEAR

xi

olumns ontain the results for the sear h when

AND and NEAR operators were applied to the input words

xi

. By repla ing the

words xi with their similarity lists derived from WordNet, the number of do uments
retrieved in reased, as expe ted. The results obtained in these ases, with an AND,
38

respe tively a NEAR operator applied to the similarity lists, are presented in the
olumns AND wi and NEAR wi .
The next olumn ontains the number of do uments extra ted when the new
operator PARAGRAPH 2 (meaning two onse utive paragraphs) was applied to words
from the similarity lists. The results were en ouraging. The number of do uments
retrieved was small and orre t answers were found almost in all ases.
The last olumn shows that the operator SENTENCE was too restri tive, produ ing orre t answers only in few ase.
A summary of the results, for the 100 questions that we used to test our system,
is presented in Table 4.6. In ea h box from this table, the top number indi ates the
number of do uments retrieved, and the bottom number indi ates the relevant do uments in top 10 ranking, and respe tively the total number of relevant paragraphs.
Table 4.3. Summary of results for the 100 test questions. In ea h box, the top
number indi ates the number of do uments retrieved, and the bottom number
indi ates the relevant do uments in top 10 ranking, and respe tively the total
number of relevant paragraphs.

Question
Average question from the TREC set

AND NEAR
xi

7,746
0.16
Average question from the REAL set 13,510
0.63

xi

AND NEAR PARAGRAPH


wi

258 25,803
0.48 0.44
1,843 28,715
1.24 0.61

wi

332
0.88
3,003
1.36

wi

26.04
11.10
48.95
13.58

4.6.1. Dis ussion


It is interesting to observe that the query extension determined an in rease
in the number of do uments, by a number of times varying from 0 (meaning equal
number of do uments retrieved for both the unextended and extended queries) to 32.
The number of relevant do uments found in top 10 ranking was also in reased
for the sear h performed with an extended query. As it an be seen from the summary
39

results presented in the table above, the query extension in reased the number of
do uments retrieved, and also the number of relevant do uments in top 10 ranking.
With the PARAGRAPH operator, the number of relevant do uments (paragraphs) retrieved de reased signi antly. Table 4.4 presents the pre ision and the

resolution of the sear h performed with AltaVista for the two sets of questions. Again,
the pre ision here is de ned as the number of relevant do uments retrieved in top 10
hits, while the resolution measures the number of questions for whi h an answer ould
be found.
Table 4.4. Pre ision and resolution for sear hes performed with
AltaVista for the 100 test questions.

AND NEAR

Question

xi

Pre ision

xi

AND NEAR
wi

wi

Average question from the TREC set 1.6% 4.8% 4.4% 8.8%
Average question from the REAL set 6.3% 12.43% 6.09% 13.65%
Resolution

Average question from the TREC set


Average question from the REAL set

36%
30%

44%
42%

20%
28%

36%
48%

For the 100 tests that we performed using this new operator, we obtained a

full-pre ision (as it is de ned in se tion 4.1) of 42.6% in the ase of TREC questions, and 27.7% in the ase of REAL questions. The resolution, i.e. the number of
answered questions, was 90% in the ase of TREC questions, and 66% in the ase of

REAL questions; omparing this with the resulution of the sear hes performed with
AltaVista, as shown in table 4.4, it results that more questions have been answered
using our system, than using AltaVista. We explain the fa t that the resolution of
the se ond set was smaller than the one of the rst set in the next subse tion: there
are several ases when no answer ould be nd, due either to very general or very
spe i terms used when formulating the questions.

40

It is hard to ompare our system performan e with the performan e a hieved by


other implementations; systems implemented so far, that try to retrieve answers for
narrow questions, (1) address spe i sear hes, as for example [14 whi h is designed
for nding legal resour es on Internet or (2) address narrower domains on the Web,
as the system des ribed in [9 whi h uses the les of \Frequently Asked Questions"
(FAQs) asso iated with many Usenet groups.
For the results obtained during the TREC tests, a omparison an be made
with the work des ribed in [44, even this work is related with the task of retrieving
information on very large olle tions of texts, rather than on the Internet. In [44 it
is reported an average full-pre ision of 36% for full topi statements. Our result of
43% full-pre ision, in retrieving information for narrow questions on heterogeneous
domains on Internet, is thus en ouraging.
A smaller pre ision of only 27.7% has been a hieved in the ase of REAL questions. The explanation for this is the fa t that users tend to use short questions, whi h
lead to a very large number of do uments found and makes mu h harder the task of
retrieving relevant information. This explains also the higher resolution obtained for
the TREC set respe t to the REAL set.
4.6.2. Limitations
From the possible types of questions, as presented in se tion 4.3, our system
an provide an answer to questions from the rst lass. This is not a strong limitation
as about 80% of questions usually asked by users are in this lass.
The questions in this lass an be further subdivided into abstra t and on rete
questions. Even the method proposed in this thesis was able to improve the pre ision
in nding orre t answers to abstra t or on rete queries, still there are parti ular
questions for whi h no relevant answers ould be found.
Very short questions usually lead to a very large number of do uments in whi h
it is hard to nd relevant information. Consider, as an example, the question \What

is the land like in Costa Ri a?". The queries formed with boolean operators led to
41

26,304 do uments when the AND operator has been used, respe tively 1,956 do uments when we used the NEAR operator, with no relevant information in top 10
ranked do uments. The query expansion phase brought no modi ations in the stru ture of the query, as both \land" and \Costa Ri a" have no synonyms. Using the
PARAGRAPH operator, 713 paragraphs have been retrieved, too many to be useful
in the pro ess of nding a relevant answer.
At the other extreme, questions with very spe ialized terms lead to no results.
For the question \Where an I nd a artoon depi ting the Sugar A t of 1764?", we
obtained 0 do uments using an AND-query, and 0 do uments with a NEAR-query.
The query expansion phase in reased the former number to 15 (still, with no relevant
information in these do uments), and the PARAGRAPH operator ould not nd any
relevant information.
4.6.3. Extensions.
A more exible NEAR sear h ould be implemented with a new operator SEQUENCE (W1 dW2 d::::Wn ), where d is a numeri variable that indi ates the distan e
between the words in the

lists for whi h the sear h is done. The SEQUENCE

operator requires that the sequen e of the words in the similarity list be maintained
as spe i ed.
There are several other posible ways of improving the web sear h not dis ussed
in this thesis. One su h a possibility is to index words by their WordNet senses.
This of ourse implies some on-line word-sense disambiguation of do uments whi h
may be possible in not too distant future. Semanti indexing has the potential of
improving the ranking of sear h results, as well as to allow information extra tion of
obje ts and their relationships [36. Another way to improve the web sear h is to use
ompound nouns or ollo ations. In WordNet there are thousands of groups of words
su h as mother in law, sto k market, et ., that point to their respe tive on ept. Ea h
ompound noun is better indexed as one term. This redu es the storage spa e for the
sear h engine and may in rease the pre ision.
42

CHAPTER 5
CONCLUSIONS AND FURTHER WORK
WordNet is a ne grain MRD and this makes it more di ult to pinpoint the
orre t sense ombination sin e there are many to hoose from and many are semanti ally lose. For appli ations su h as ma hine translation, ne grain disambiguation
works well, but for information extra tion and some other appli ations this is an
overkill, and some senses may be lumped together. The ranking of senses is useful
for many appli ations.
In this thesis, we presented a method whi h disambiguates all the nouns, verbs,
adje tives and adverbs in a text. Is is based on the idea of word-word dependen ies
found within a ontext. Our WSD method ombines both the statisti al results, as
they are gathered from the Internet, and semanti density measures, al ulated on
WordNet hierar hies.
An important aspe t of this method is that it provides a sense ranking over the
possible senses that words might have, for a given ontext. This information proves
to be useful in tasks su h as information retrieval, where knowing the rst possible
orre t senses of the words in the input question/senten e an highly improve the
quality of the information retrieved.
The main drawba k of this method regards the time needed to perform the
disambiguation of words. Both algorithms are time onsuming:
1. The rst algorithm gather statisti s from the Internet. Even it provides us with
good results in ranking word senses, this algorithm has the inherent problems
related to the use of the Internet, su h as network ongestion, sites found at
long distan e or sites whi h are unrea hable.
43

2. The se ond algorithm al ulates the semanti distan e between words, using
WordNet. As des ribed in se tion 3.3, this makes use of the semanti hierar hies
of the di erent senses of the words; the time needed to ompute this metri
varies dire tly with the size of these hierar hies, and for large hierar hies the
time will be of the order of minutes.
Table 5 presents the average time needed to exe ute these algorithms, as determined for 20 verb-noun pairs.
Table 5.1. Time measurements for the disambiguation
method
Algorithm
Time (se .)
Algorithm 1 (statisti s from the Internet)
180
Algorithm 2 (semanti density)
50
Total
230
The average time for disambiguating a pair of words is 230 se onds. This does
not onstitute a big problem when the disambiguation has to be done for small texts,
like questions or senten es. But it be omes overwhelming when large texts have to
be semanti ally tagged. In this later ase, a di erent approa h should be onsidered.
To have a robust WSD system whi h ould work for large texts, as well as for
small senten es, further work is needed.
1. Improve the POS tagger with on ept identi ation. As it is now, Brill's POS
tagger re ognizes only single words; for example, \ hild are", whi h is identi ed
by WordNet as being a single noun on ept, is tagged by Brill's tagger as two
separate nouns, \ hild" and \ are". A possible solution to this problem is to
ombine the lexi on provided by WordNet with the POS tagged developed by
Eri Brill, su h as to obtain an improved tagger whi h will re ognize the lexi al
expressions.
44

2. Identify the synta ti relations in the input text. The basi relations whi h have
to be identi ed are (verb, obje t), (verb, subje t), (adje tive, noun), (adverb,
verb). For this purpose, we plan to use either the synta ti parser developed at
University of Pennsylvania, or the one from Xerox, depending on the a ura y
whi h we will obtain during tests.
3. An error rate is introdu ed into our semanti density method, be ause of the
fa t that the glosses are not POS tagged, nor synta ti ally parsed. We believe
that a signi ant improvement an be a hieved if we build the ontext of a
parti ular word by taking into onsideration the synta ti relations. At this
point, we have to make a hoi e.
(a) We an try to repla e the examples provided by the glosses with examples
found in SemCor. The les from SemCor are already POS tagged, and we
will have to identify also the synta ti relations in these texts. The original
algorithm will be modi ed su h as to ompute the synta ti density among
words by taking into onsideration only pairs of words in the same synta ti
relation.
(b) We an ontinue to use the glosses, but we should modify them su h as
POS and synta ti information is in luded.
Based on the tests whi h we are going to perform, regarding the a ura y
a hieved and the time performan e of ea h of these methods, we will de ide
whi h approa h an lead to a more robust WSD system.
As an appli ation of the WSD method, we presented a system for sear hing
the Internet, improved with NLP te hniques. This system, des ribed in Chapter 4,
performs two main tasks: (1) it expands the input question; the sear h is extended
with words semanti ally related to the keywords from the original question; (2) new
lexi al operators have been used to fet h the relevant information from the do uments
found on the Internet. The results obtained are promising: 42.6% in the ase of TREC
45

questions, and 27.7% in the ase of REAL questions represent pre isions higher (or
omparable) with previous results in this eld.
From the types of questions presented in se tion 4.3, our system is limited to
the rst lass of questions, whi h involves basi ally an identi ation of information.
This is not a hard limitation though, as studies performed in this eld proved that
80% of the questions usually asked by people are in luded in this lass. Further work
will address the se ond lass of questions; for this, the system will have to identify
several fa ts related to the subje t of the question, and eventually summarize these
fa ts in order to give an answer.
The broad use of natural language queries in information retrieval is still beyond
the apabilities of urrent natural language te hnology. Ma hine readable di tionaries,
su h as WordNet, prove to be useful tools for web sear h. However, their use for
sear hing the Internet has been limited so far [2, [19, [21.

46

REFERENCES
[1

Agirre, E., and Rigau, G.

[2

Allen, B.

A proposal for Word Sense Disambiguation using

on eptual distan e. In Pro eedings of the 1st International Conferen e on Re ent


Advan es in Natural Language Pro essing (Velingrad, 1995).
WordWeb - using the lexi on for WWW. http://www.inferen e. om,
1997. Inferen e Corporation.

[3 AltaVista, 1999. Digital Equipment Corporation, "http://www.altavista. om".


[4

[5

Anikina,

N., Golender, V., Kozhukhina,

S., Vainer, L., and Za-

gatsky, B. Reason: NLP-based sear h system for WWW. In Pro eedings of the
Ameri an Asso iation for Arti al Intelligen e Conferen e, Spring Symposium,
\NLP for WWW" (Stanford University, CA, 1997), pp. 1{10.
Bloom, B., Engelhart, M., Furst, E., Hill, W., and Krathwohl, D.

Taxonomy of Edu ational Obje tives, Handbook 1: Cognitive Domain. David


M Kay Company In ., 1956.

A simple rule-based part of spee h tagger. In Pro eedings of the 3rd


Conferen e on Applied Natural Language Pro essing (Trento, Italy, 1992).

[6

Brill, E.

[7

Bru e, R., and Wiebe, J.

[8

Bu kley, C., Salton, G., Allan, J., and Singhal, A.

[9

Burke, R., Hammond, K., and Kozlovsky, J.

[10

[11

Word Sense Disambiguation using de omposable

models. In Pro eedings of the 32nd Annual Meeting of the Asso iation for Computational Linguisti s (ACL-94) (LasCru es, NM, June 1994), pp. 139{146.

Expansion Using SMART: TREC 3. NIST, 1994, pp. 69{81.

Automati Query

Knowledge-based informa-

tion retrieval from semi-stru tured text. In Pro eedings of the Ameri an Asso iation for Arti al Intelligen e Conferen e, Fall Symmposium, \AI Appli ations
in Knowledge Navigation & Retrieval" (Cambridge, MA, 1995).
Callan, J.

Passage-level eviden e in do ument retrieval. In Pro eedings of the

17th Annual International ACM SIGIR, Conferen e on Resear h and Development in Information Retrieval (Dublin, Ireland, 1994), pp. 302{310.
Callan, J., Croft, W., and Harding, S.

The INQUERY retrieval sys-

tem. In Pro eedings of the 3rd International Conferen e on Database and Expert
Systems Appli ations (1992), pp. 78{83.
47

[12

[13

Cowie, J., Guthrie, L., and Guthrie, J. Lexi al disambiguation using


simulated annealing. In Pro eedings of the 5th International Conferen e on Computational Linguisti s COLING-92 (1992), pp. 157{161.
Fellbaum, C.

WordNet, An Ele troni Lexi al Database. The MIT Press, 1998.

[14 FindLaw, internet legal resour es. http://www. ndlaw. om/index.html, 1997.
[15

One sense per dis ourse. In

Gale, W., Chur h, K., and Yarowsky, D.

Pro eedings of the DARPA Spee h and Natural Language Workshop (Harriman,
New York, 1992).

[16

Gravano,

L.,

Chang,

K.,

Gar ia-Molina,

H.,

Lagoze,

C.,

and

STARTS, Stanford proto ol proposal for Internet retrieval and


sear h. Digital Library Proje t, Stanford University, 1997.
Paep ke, A.

Enri hing the WordNet Taxonomy with


Contextual Knowledge A quired from Text. AAAI/MIT Press, 1999.

[17

Harabagiu, S., and Moldovan, D.

[18

Hearst, M.

Multi-paragraph segmentation of expository text. In Pro eedings

of the 32nd Annual Meeting of the Asso iation for Computational Linguisti s
(Las Cru es, NM, 1994), pp. 9{16.

[19

S atter/gather as a tool for

Hearst, M., Karger, D., and Pedersen, J.

the navigation of retrieval results. In Pro eedings of the Ameri an Asso iation for

Arti al Intelligen e Conferen e, Fall Symposium \AI Appli ations in Knowledge


Navigation & Retrieval" (Cambridge, MA, 1995), pp. 65{71.
[20

Ishikawa, K., Satoh, K., and Okumura, A.

on Paragraphs of the Relevant Do uments. NIST, 1997, pp. 577{585.

[21

Katz, B.

[22

Leong, M.

[23

Li, X., Szpakowi z, S., and Matwin, M.

[24
[25

Query Term Expansion based

From senten e pro essing to information a ess on the World Wide


Web. In Pro eedings of the Ameri an Asso iation for Arti al Intelligen e Conferen e, Spring Symposium, \NLP for WWW" (Stanford, CA, 1997), pp. 77{86.

Con rete Queries in Spe ialized Domains: Known Item as Feedba k


for Query Formulation. NIST, 1997, pp. 541{550.
A WordNet based algorithm for

word semanti sense disambiguation. In Pro eedings of the 14th International


Joint Conferen e on Arti ial Intelligen e IJCAI-95 (Montreal, Canada, 1995).
Lu, X., and Keefer, R.

Query Expansion/Redu tion and its Impa t on Re-

trieval E e tiveness. NIST, 1994, pp. 231{240.

Building a large
annotated orpus of english: the Penn Treebank. Computational Linguisti s 19,
2 (1993), 313{330.
Mar us, M., Santorini, B., and Mar inkiewi z, M.

48

[26

M Roy, S.

[27

Mihal ea, R., and Moldovan, D.

[28

Mihal ea, R., and Moldovan, D.

[29

Using multiple knowledge sour es for Word Sense Disambiguation.

Computational Linguisti s 18, 1 (1992), 1{30.

Word Sense Disambiguation based on

semanti density. In Pro eedings of COLING-ACL '98 Workshop on Usage of


WordNet in Natural Language Pro essing Systems (Montreal, Canada, 1998).
An automati method for generating
sense tagged orpora. In Pro eedings of AAAI-99 (Orlando, FL, July 1999). (to
appear).

Mihal ea, R., and Moldovan, D. A method for Word Sense Disambiguation
of unrestri ted text. In Pro eedings of the 37th Annual Meeting of the Asso iation for Computational Linguisti s (ACL-99) (Maryland, NY, June 1999). (to
appear).

WordNet: A lexi al database. Communi ation of the ACM 38, 11


(1995), 39{41.

[30

Miller, G.

[31

Miller, G., Chodorow, M., Landes, S., Lea o k, C., and Thomas,

[32

[33
[34

[35

[36

[37

R. Using a semanti on ordan e for sense identi ation. In Pro eedings of the
4th ARPA Human Language Te hnology Workshop (1994), pp. 240{243.
Miller, G., Lea o k, C., Randee, T., and Bunker, R.

A semanti

on ordan e. In Pro eedings of the 3rd DARPA Workshop on Human Language


Te hnology (Plainsboro, New Jersey, 1993), pp. 303{308.
An WordNet-based interfa e to Internet
sear h engines. In Pro eedings of FLAIRS-98 (Sanibel Island, FL, May 1998).
Moldovan, D., and Mihal ea, R.

Us : Des ription of the SNAP system used for MUC-5.


In Pro eedings of the 5th Message Understanding Conferen e (Baltimore, MD,
1993).

Moldovan, D. e. a.

Integrating multiple knowledge sour es to disambiguate


word sense: An examplar-based approa h. In Pro eedings of the 34th Annual
Meeting of the Asso iation for Computational Linguisti s (ACL-96) (Santa Cruz,
1996).
Ng, H., and Lee, H.

Pustejovsky, J., Boguraev, B., Verhagen, M., Buitelaar, P., and

Johnston, M. Semanti indexing and typed hyperlinking. In Pro eedings of the


Ameri an Asso iation for Arti al Intelligen e Conferen e, Spring Symposium,
\NLP for WWW" (Stanford, CA, 1997), pp. 120{128.
Resnik, P. Sele tional preferen e and sense disambiguation. In Pro eedings of
ACL Siglex Workshop on Tagging Text with Lexi al Semanti s, Why, What and
How? (Washington DC, April 1997).

49

[38

Resnik, P., and Yarowsky, D.

[39

Rigau, G., Atserias, J., and Agirre, E.

[40

Salton, G., and Lesk, M.

[41

[42

A perspe tive on Word Sense Disambiguation methods and their evaluation. In Pro eedings of ACL Siglex Workshop on
Tagging Text with Lexi al Semanti s, Why, What and How? (Washington DC,
April 1997).
Combining unsupervised lexi al
knowledge methods for Word Sense Disambiguation. Computational Linguisti s
(1997).

Computer evaluation of indexing and text pro essing. Prenti e Hall, Ing. Englewood Cli s, New Jersey, 1971, pp. 143{180.
Multi-servi e sear h and omparison using the
Meta rawler. In Pro eedings of the 4th International World Wide Web Conferen e (Boston, MA, 1995), pp. 195{208.
Selberg, E., and Etzioni, O.

General Word Sense Disambiguation method based on a full sentential ontext. In Usage of WordNet
in Natural Language Pro essing, Pro eedings of COLING-ACL Workshop (Montreal, Canada, July 1998).

Stetina, J., Kurohashi, S., and Nagao, M.

[43 Text retrieval onferen e. http://tre .nist.gov.


Query expansion using lexi al-semanti relations. In Pro eedings of the 17th Annual International ACM SIGIR, Conferen e on Resear h and
Development in Information Retrieval (Dublin, Ireland, 1994), pp. 61{69.

[44

Voorhees, E.

[45

Yarowsky, D.

Unsupervised Word Sense Disambiguation rivaling supervised

methods. In Pro eedings of the 33rd Annual Meeting of the Asso iation for
Computational Linguisti s (ACL-95) (Cambridge, MA, 1995 1995), pp. 189{196.

50

You might also like