You are on page 1of 7

Weekly Progress Report

Kevin Glass
2005-08-23
1 Full Name
Kevin Robert Glass
2 Title of Project
Natural Language Information Extraction for Constructing Virtual Environments
3 Supervisor
Professor Shaun Bangay
4 Date
2005-08-23
5 Previous Short Term Objectives
The short term objectives for the coming week are as follows:
Complete a version of the manual annotation software, and write up about the design of
the software.
Run one anaphora resolution test - based on verb-noun theory
1
Writeup at least text-tokenisation as part of the thesis
Correct the Poster description page
Complete and edit the coursework papers, in order to remove it from the list of distractions!
6 Progress
6.1 Manual Annotation Software
6.1.1 Progress
The manual annotation software is in the process of being implemented. The following steps are
working:
Ibizan is used as a framework for the annotation process. A new application has been
built with 3 viewports, which display the object from the front, the left and the top. Each
viewport can be rotated to see the object more clearly.
Standard rotations have been implemented in the form of buttons alongside the viewport.
Each one rotates the object ninety degrees around a different axis. Instructions on the
screen indicate that the object should be rotated to face the viewer in the Front view por-
tal. These rotations have been implemented in the Mesh class, since we wish to standardise
the actual geometry. Note that forward is facing the positive z-axis.
Object normalisation has been implemented by determining the principle axis of the object,
and scaling the object by 1 over the length of this axis. This keeps the proportions of the
object while reducing it to t within the unit cube.
A function to enter a descriptive keyword has been included, which automatically fetches
synonyms, hypernyms and hyponyms from WordNet. The senses of a specic word is
searched, identifying only senses which are descendants of physical_object. These senses
are listed. A facility to choose a sense and list the corresponding hypernyms and hyponyms
is implemented. One can select one of these words as an alternative to the original key-
word.
The functionality to save the standardised geometry has been implemented, with the output
being an OFF format object le.
2
6.1.2 Problems
A substantial amount of problems slowed the progress of this section, namely:
The acquisition and inclusion of a WordNet interface. Three options were available here:
nd a suitable JAVA interface, nd a suitable C++ interface which would then interface
with JAVA at a higher level, or access the WordNet data les directly. The last option was
dismissed since too much time would be wasted reinventing something that has already
been implemented. Two JAVA APIs were found, JWordNet and JWNL. JWordNet, while
simple to install, proved to be non-intuitive, and still require a vast amount of knowledge
regarding the structure of WordNet. Therefore JWNL is used, however, it is still a highly
complex API. A number of problems were experienced during its use:
Including the source les in compilation to native code with the rest of Ibizan. For-
tunately JWNL is shipped with the full source, and hence could be compiled to a
shared library. However, a number of issues needed to be resolved such as includ-
ing the Apache commons-logging module. However, the API has successfully been
included.
Lack of knowledge regarding the API caused a large amount of time to be consumed
attempting to understand how the API worked, familiarisation with the classes, and
how the structures relate to WordNet. However, a functional knowledge has been
achieved, and suitable outputs including the extraction of hypernym and hyponym
sets can be obtained.
One small error in the APIs conguration caused a large amount of delay, since it was
congured to use the latest version of WordNet (2.1), while the API is programmed
for WordNet 2.0. This caused some inexplicable NullPointerExceptions, and unreli-
ability which was difcult to track down. However, the implementation seems to be
running reliably at present.
Lack of knowledge regarding the structure of Ibizan caused progress to be slow, however
this is expected to improve.
Thus the following still needs to be implemented with regards to the manual annotation:
The output of a corresponding XML document containing the annotation data.
The creation of an indexing system which will aid in the extraction of objects.
The creation of a categorisation scheme, and its storage representation.
3
6.2 Script Extraction Progress
The test to be conducted including verb-noun groupings has not yet been completed. As men-
tioned previously, the process was slowed by the learning and inclusion of a WordNet API.
However, preliminary results from the test are mixed. A large number of unknown candidates
are being returned. However, in many cases the correct pronoun is being extracted. It is unclear
at this stage whether this is a result of an error in the code, however, an experiment is proposed
which aims to nd out the common structure of sentences surrounding quoted speech, to see if
any heuristics can be identied.
6.3 Thesis
The following is the result of the writeup on text-tokenisation that has been completed thus far.
6.3.1 Text Preprocessing and Tokenisation
Introduction Written language consists of a collection of symbols, the grouping of which
forms the basis of meaning and understanding. At the lowest level exists the grouping of char-
acters to form representations of words, that are in turn grouped to form sentences. Depending
on the application, further groupings into paragraphs or chapters may also take place. The set of
characters includes not only letters, but also numbers, punctuation, as well as non-alphanumeric
symbols (for example, the dollar sign - $). Generally, words are constructed from the set of let-
ters. Additional meaning, including the indication of sentence boundaries, quotations, emphasis
and expression are indicated by appending punctuation to groupings of words.
For computational processing, it is desired that symbols in written text be separated into
logical units that indicate some base form of meaning. In the simplest case, for example, a stream
of letters is separated into words by extracting letters that fall between instances of white-space.
This process of logical grouping is named tokenisation.
The tokenisation of text in the context of scene description has a number of requirements.
Firstly, text representations which are noisy (that is, lled with unnecessary characters) must be
ltered. Additionally, the tokens extracted from the text must be easily matched against sources
which can provide semantic meaning, such as dictionaries. Finally, the logical structure of the
text, for example sentence boundaries, must be accurately determined from the extracted tokens.
This section discusses the process of tokenisation which is implemented for the Text-to-Scene
conversion system. In particular, the need for text cleanup is discussed, the need for tokenisation
4
in the overall conversion process as well as a description of the problems and corresponding
solutions encountered during the construction of this module are examined.
Context The process of text-to-scene conversion requires that the input text be syntactically
tagged, as well as parsed. This is done using a number of parts-of-speech tagging tools and
syntactic parsing tools. Each of these tools require that the input text be formatted in a partic-
ular way. For example, many parts-of-speech tagging tools require that input contains a single
token per line, with punctuation marks considered as individual tokens. Other tools require that
input text contain one sentence per line, which requires the correct identication of sentence
boundaries.
The fundamental requirement of the tokenisation process is to preprocess raw text from the
ctional sources, producing a standard representation which:
Can be converted into the required input format for the various tools.
Can be used to correlate the information provided as output from the various tools.
Figure 1 describes the signicance of text-tokenisation as a preprocessor to the text-to-scene con-
version system. The raw text is fed as input to a text-cleanup module, which removes formatting
inconsistencies in the digital text. The output of this module is used as input to the tokenisa-
tion process which produces a tokenised version of the raw text. The tokenised version is easily
formatted into the required input for the various tools. The output of the various tools is then
correlated against the original token le in order to produce a unied representation of the text
and the data collected thus far.
Text-cleanup Raw text in digital format sourced from corpora such as Gutenberg is generated
by either manually typing in texts from the original sources or scanned using Optical Character
Recognition. The varying manner in which the text is digitised leads to inconsistencies in sen-
tence and paragraph boundaries. Figure 2 presents an example of text obtained from Gutenberg.
In this instance, ^M and $ indicate carriage-return and line feed. Note how these occur in a man-
ner indicative of formatting rather than logical structure. It is therefore necessary remove such
symbols from the text, without losing sentence and paragraph boundaries.
The following solution has been implemented which removes formatting marks such as
carriage-returns and line feeds in the relevant positions.
5
Raw fictional text
Text-cleanup
Text-tokenisation
Parts-of-speech tagging
Syntactic Parsing
Correlation of data
Tokenised Text
Unified Representation
Formatter
Formatter
Figure 1: The role of the tokenised document in the preprocessing phase
CHAPTER ONE^M$
^M$
THE EVE OF THE WAR^M$
^M$
^M$
No one would have believed in the last years of the nineteenth^M$
century that this world was being watched keenly and closely by^M$
intelligences greater than mans and yet as mortal as his own; that as^M$
men busied themselves about their various concerns they were^M$
scrutinised and studied, perhaps almost as narrowly as a man with a^M$
Figure 2: Example of text sourced from Gutenberg, War of the Worlds by H.G. Wells
6
6.4 SAICSIT Poster
The one page writeup for the SAICSIT Poster has been completed, and is attached to the end of
this document.
6.5 Coursework
The coursework articles have all been edited, and standardised to a common SIGGRAPH format.
They have also been combined to form a single document, which can be downloaded from the
web-page.
7 Short Term Objectives
The short term objectives for the coming week are as follows:
Complete the manual annotation software, and write up about the design of the software.
Create the XML output le
Design a categorisation scheme
Improve the script extraction based on verb-noun theory
Perform an analysis of surrounding sentences
Run the WordNet test to determine requirements of an object library for ctional text
Run a number of tests on WordNet and nd out information such as
Average number of senses per word
Number of words that can be traced back to physical entities/speech/abstract entities
Complete the text-tokenisation section of the thesis
Include some data regarding the cleanup process, such as the old number of lines, vs
cleaned text etc.
Think up some tests which indicate the validity of the implemented tokenisation
scheme.
7

You might also like