Nine Issues in Translation

Nine Issues in Speech Translation
Author(s): Mark Seligman

Source: Machine Translation, Vol. 15, No. 1/2, Spoken Language Translation (2000), pp. 149-185
Published by: Springer
Stable URL: http://www.jstor.org/stable/40007086
Accessed: 23/09/2010 03:42
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=springer.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Springer is collaborating with JSTOR to digitize, preserve and extend access to Machine Translation.
http://www.jstor.org
MachineTranslation 15: 149-185, 2000.
^M J49
y\ © 2001 KluwerAcademicPublishers. Printedin the Netherlands.
Nine Issues in Speech Translation
MARK SELIGMAN
GETA,UniversiteJosephFourier,385 rue de la Bibliotheque,38041 GrenobleCedex9, France
(E-mail:markseligman@earthlink.net)
Abstract. This papersketchesresearchin nine areasrelatedto spokenlanguagetranslation:interac-

tive disambiguation(two demonstrationsof highly interactive,broad-coveragespeech translationare
reported);system architecture;datastructures;the interfacebetweenspeechrecognitionandanalysis;
the use of naturalpauses for segmentingutterances;example-basedmachine translation;dialogue
acts;the trackingof lexical co-occurrences;and the resolutionof translationmismatches.
Key words: dialogue acts, example-basedmachinetranslation,interactivedisambiguation,pauses,

speech recognition,spokenlanguagetranslation
1. Introduction
This paperreviews some aspectsof the author'sresearchin spokenlanguagetrans-
lation (SLT) since 1992. Since the purposeis to promptdiscussion, the treatment
is informal,programmatic,and speculative.Thereis frequentreferenceto workin
progress- in otherwords,workfor which evaluationis incomplete.
The paper sketches work in nine areas: interactivedisambiguation;system
architecture;data structures;the interfacebetween speech recognition (SR) and
analysis; the use of naturalpauses for segmenting utterances;example-based
machinetranslation;dialogue acts; the trackingof lexical co-occurrences;and the
resolutionof translationmismatches.There is no attemptto provide a balanced
surveyof the SLTscene. Instead,the hope is to providea provocativeandsomewhat
personallook at the field by spotlightingit fromnine directionsin some respects,
to offer an editorialratherthanpurelya report.
One of the most significant and difficult aspects of the SLT problem is the
need to integrateeffectivelymanydifferentsortsof knowledge:phonological,pros-
odic, morphological,syntactic,semantic,discourse,anddomainknowledgeshould
ideally work togetherto producethe most accurateand helpful translation.Thus a
trend towardgreaterintegrationof knowledge sources is visible in currentSLT
research(e.g., in the Verbmobilproject,Wahlster,1993), and most of the work
describedbelow is in this integrativedirection.Many of the issues to be discussed
here could in fact be addressedby dedicatedpieces of softwareplayingpartsin an
integratedSLTsystem. The paper'sconclusionwill review the issues by sketching
an idealized system of this sort - a kind of personal dream team in which the
componentsare team members.
150 MARKSELIGMAN
However,the first topic to be discussed is a renegade,headed in exactly the

opposite direction. This is because, while continuing my concern with integ-
ration of SLT system components, I have become interested in an alternative
system design which, in sharpcontrast,stresses a clean separationbetween SR
and translation.The thrust of this alternative"low road" or "quick and dirty"
approachis to substitute temporarilyintensive user interactionfor system in-
tegration,therebyattemptinga radicaldesign simplificationin hopes of fielding
practical,broad-coveragesystems as soon as possible.
To accommodatethis renegadeon the one hand and the team players on the
other,the paperwill be not only somewhatpersonal,butalso two-faced.I will begin
by advocatinga "low road",non-integratedapproachfor the neartermthroughout
Section 2. Two demonstrationsof highly interactive,broad-coverageSLT will be
reportedanddiscussed.Then,performingan about-face,I will go on in the remain-
ing sections to considerelements of a more satisfyingintegratedapproachfor the
longerterm.
2. Interactive Disambiguation
In the presentstateof the art,severalstages of SLTleave ambiguitieswhich current
techniquescannotyet resolve correctlyandautomatically.Suchresidualambiguity
plagues SR, analysis,transfer,and generationalike.
Since users can generally resolve these ambiguities, it seems reasonableto
incorporatefacilities for interactivedisambiguationinto SLT systems, especially
those aimingfor broadcoverage.A good idea of the rangeof workin this areacan
be gainedfrom Boitet (1996a).
In fact, Seligman(1997) suggests that,by stressingsuch interactivedisambigu-
ation - for instance,by using highly interactivecommercialdictationsystems for
input, and by adaptingexisting techniquesfor interactivedisambiguationof text
translation(Boitet, 1996b;Blanchon, 1996) - practicallyusable SLTsystems may
be constructablein the near term. In such "quickand dirty"or "low road"SLT
systems, user interactionis substitutedfor system integration.For example, the
interfacebetween SR and analysis can be suppliedentirelyby the user, who can
correctSR resultsbefore passing them to translationcomponents,thus bypassing
any attemptat effective communicationor feedbackbetween SR and MT.
The argument,however,is not thatthe "highroad"towardintegratedand max-
imally automaticsystems should be abandoned.Rather,it is that the low road
of forgoing integrationand embracinginteractionmay offer the quickest route
to widespreadusability, and that experience with real use is vital for progress.
Clearly, the high road is the most desirable for the longer term: integrationof
knowledgesourcesis a fundamentalissue for bothcognitiveandcomputerscience,
and maximally automaticuse is intrinsicallydesirable.The suggestion, then, is
that the low and high roads be traveledin tandem;and that even systems aiming
for full automaticityrecognize the need for interactiveresolutionwhen automatic
NINE ISSUES IN SPEECHTRANSLATION 151
resolutionis insufficient.As progressis made along the high road and increasing
knowledgecan be appliedto automaticambiguityresolution,interactiveresolution
shouldbe necessaryless often.Whenit is necessary,its qualityshouldbe improved:
Questionsput to the user shouldbecome more sensible and more tightlyfocused.
2.1. TWO INTERACTIVE DEMOS
These design concepts have been informallyand partlytested in two demos, first
at the Machine TranslationSummitin San Diego in October,1997, and a second
time at the meeting of C-Star II (Consortiumfor Speech TranslationAdvanced
Research)in Grenoble,France,in January,1998. Both demos were organizedand
conductedunderthe supervisionof MaryFlanagan,and both demo systems were
basedupona text-basedchattranslationsystempreviouslybuiltby Flanagan'steam
at CompuServe,Inc. The company'sproprietaryon-line chattechnologywas used,
as distinctfrom InternetRelay Chat,or IRC (Pyra, 1995).1
In an on-line chat session, usersmost often converseas a group,thoughone-on-
one conversationsare also easy to arrange.Each conversanthas a small window
used for typing input. Once the inputtext is finished,the user sends it to the chat
serverby pressingReturn.The text comes backto the senderafteran imperceptible
interval,andappearsin a largerwindow,prefacedby a headerindicatingthe author.
Since this largerwindow receives input from all partiesto the chat conversation,
it soon comes to resemble the transcriptof a cocktail party,often with several
conversationsinterleaved.
Each partynormallysees the "same"transcriptwindow.However,priorto the
SLT demos, CompuServehad arrangedto place at the chat server a commercial
translationsystem of the direct variety, enabling several translationdirections.
Once the user of this experimentalchat system had selected a direction (say
English-French),all lines in the transcriptwindow would appearin the source
language (in this case, English), even if some of the contributionsoriginatedin
the targetlanguage(here,French).Bilingual text conversationswere thus enabled
betweenEnglishtypists and writersof French,German,Spanish,or Italian.
At the time of the demos, total delay from the pressing of Returnuntil the
arrivalof translatedtext in the interlocutor'stranscriptwindow averagedaboutsix
seconds, well withintolerablelimits for conversation.2
At the author'ssuggestion and with his consultation,highly interactiveSLT
demos were created by adding SR front ends and speech-synthesisback ends
to CompuServe'stext-basedchat-translationsystem. Two laptopswere used, one
runningEnglish inputand outputsoftware(in additionto the CompuServeclient,
modifiedas explainedbelow), and one runningthe comparableFrenchprograms.
Commercialdictationsoftwarewas employed for SR. For the first demo, both
sides used discrete dictation,in which short pauses are requiredbetween words;
for the second demo, English was dictatedcontinuously- thatis, withoutrequired
pauses- while Frenchcontinuedto be dictateddiscreetly.3
152 MARKSELIGMAN
At the time of the demos, the discreteproductsallowed dictationdirectlyinto

the chat inputbuffer,but the continuousproductsrequireddictationinto theirown
dedicatedwindow.Thus for continuousEnglish input it became necessaryto em-
ploy third-partysoftware4to create a macro which (a) transferreddictatedtext to
the chat inputbufferand (b) inserteda Returnas a signal to send the chat.5
Commercialspeech synthesis programspackaged with the discrete dictation
productswere used for voice synthesis. Using developmentsoftware sold separ-
ately by the dictationvendor,CompuServe'schat client softwarewas customized
so that, as each text stringreturningfrom the chat serverwas writtento the tran-
script window, it was simultaneouslysent to the speech synthesis engine to be
pronouncedin the appropriatelanguage.The text readaloudin this way was either
the user's own, transmittedwithoutchanges, or the translationof an interlocutor's
input.
The first demo took place in an auditoriumbefore a quiet audienceof perhaps
100, while the second was presentedto numeroussmall groups in a booth in a
noisy room of medium size. Each demo began with ten scriptedand pre-tested
utterances,and then continued with improvisedutterances,sometimes solicited
from the audience- perhapssix in the first demo, and 50 or more in the second.
Some examplesof improvisedsentencesare given in (l)-(2).
(1) French: Qu'est-ceque vous etudiez?(Whatdo you study?)

English: Computerscience. (Uinformatique.)
(2) French: Qu'est-ce que vous faites plus tard? (What are you doing
later?)
English: I'm going skiing. (Je vaisfaire du ski.)
French: Vous n'avez pas besoin de travailler? (You don't need to
work?)
English: I'll take my computerwith me. {Jeprendraimon ordinateur
avec moi.)
French: Ou est-ce que vous mettrez Vordinateurpendant que vous
skiez? (Wherewill you put the computerwhile you ski?)
English: In my pocket. {Dans mapoche.)
As these examplessuggest, the level of languageremainedbasic, and sentences
were purposelykept short,with standardgrammarandpunctuation.
2.2. DISCUSSION OF DEMOS
A primarypurposeof the chat SLT demos was to show that SLT is both feasible
and suitablefor on-line chat users, at least at the proof-of-conceptlevel.
In my own view, the demos were successful in this respect. The basic feasi-
bility of the approachappearsin the fact thatmost demo utteranceswere translated
comprehensiblyand within tolerabletime limits. It is truethatthe language,while

mostly spontaneous,was consciously kept quite basic and standard.It is also true
that there were occasional translationerrors(discussed below). Nevertheless,the
demos can plausibly be claimed to show that chattersmaking a reasonableef-
fort could successfully socialize in this way. As preliminaryevidence that many
users could adjustto the system's limitations,we can remarkthat the dozen or so
utterancessuggestedby the audience,once repeatedverbatimby the demonstrators,
were successfullyrecognized,translated,andpronouncedin every case.
In additionto the general demo goals just mentioned,the authoralso had his
own, more specific axes to grindfrom the viewpointof SLTresearch.I hoped the
demos would be the first to show broad-coverageSLT of usable quality; and I
hoped they would highlightthe potentialusefulness of interactivedisambiguation
in moving towardpracticalbroad-coveragesystems.6
I believe that these goals, too, were reached. Coveragewas indeed broadby
contemporarystandards.There was no restrictionon conversationaltopic - no
need, for instance,to remainwithin the area of airlinereservations,appointment
scheduling,or streetdirections.As long as the speakersstayedwithinthe dictation
and translationlexica (each in the tens of thousandsof words), they were free to
chat andbanteras they liked.
The usefulness of interactionin achieving this breadthwas also clear: verbal
correctionsof dictation results were indeed necessary for perhaps 5% to 10%
of the input words. To give only the most annoying example, Hello was once
initially transcribedas Hollow. Here we see with painful clarity the limita-
tions of an approachwhich substitutesinteractivedisambiguationfor automatic
knowledge-baseddisambiguation:even the most rudimentarydiscourseknowledge
should have allowed the programto judge which word was more likely as a dia-
logue opener.On the other hand, the approach'scapacityto compensatefor lack
of such knowledge was also clear: a verbal correctionwas quickly made, using
facilities suppliedby the dictationvendor.
It should be stressed that the SLT system of the CompuServedemos was by
no means the first or only system to permit interactivemonitoringof SR output
beforetranslation.As farbackas the C-Star I internationalSLTdemonstrationsof
1993,7selectionamongSR candidateswas essentialfor most participatingsystems.
Similarly,selection among, or typed correctionof, candidatesis possible in most
of the systems shown in the recentC-Star II demos of July 22, 1999.8
The CompuServeexperimentswere, however, the first to demonstratethat a
broad-coverageSLT system of usable quality - a system capable of extending
coverage beyond specialized domains toward unrestricteddiscourse - could be
constructedby enabling users to correct economically the output of a broad-
coverage SR component before passing the results to a broad-coverageMT
component.
Ergonomicoperationwas an importantelementin the system's success. The SR
correctionfacilities used in the experiments- the set of verbalrevisioncommands
154 MARKSELIGMAN
suppliedby the dictationproduct,including"scratchthat","correct(word)",etc. -

were designedfor generaluse in a competitivemarket,and thus of necessity show
considerableattentionto ergonomic issues. (By contrast,the SR componentsof
otherresearchsystems continueto rely on typed correctionor menu selection.) Of
course,a smoothhumaninterfacebetweenSR andMT cannotby itself yield broad
coverage;what it can do is to permitthe unexpectedcombinationof SR and MT
componentsdevelopedseparately,with broadcoverageratherthanSLTin mind.
This relianceon interactivecorrectionraises obvious questions:Is the current
amountand type of dictationcorrectiontolerablefor practicaluse? Would addi-
tionalinteractionfor guidingor correctingtranslationbe useful?Even if potentially
useful, would it be tolerated,or would it breakthe camel's back?
2.2. 1. Correctionof Dictation

The interactionrequiredin the currentdemos for correctingdictationis just that
currentlyrequiredfor correctingtext dictation in general. All currentdictation
productsrequireinteractivecorrection.The questionis, do the advantagesof dicta-
tion over typing neverthelessjustify the cost of these products,plus the troubleof
acquiringthem, trainingthem, and learningto use them?Theirsteadilyincreasing
user base indicatesthatmanyusersthinkso. (Forthe record,portionsof this paper
were producedusing continuousdictationsoftware.)My own impressionis that,
duringthe demos, continuousdictationwith spoken correctionssuppliedcorrect
text at least twice as fast as my own reasonablyskilled typing would have done.
Forreaderswho have nevertrieddictating,a descriptionof the dictationcorrec-
tion process availablein Seligmanet al. (1998b) may help to realisticallyestimate
the correctionburden.
While a strict hands-offpolicy was adoptedfor the demos, it is worth noting
thattypedtext andcommandscan be freely interspersedwith spokentext andcom-
mands.It is sometimeshandy,for instance,to select an errorusing the mouse, and
then verballyapply any of the above-mentionedcorrectioncommands.Similarly,
when spelling becomes necessary,typing often turnsout to be fasterthan spoken
spelling.Thus verbalinputbecomes one option among several,to be chosen when
- as often happens- it offers the easiest or fastest path to the desired text. The
question,then, is no longerwhetherto type or dictatethe discourseas a whole, but
which mode is most convenientfor the inputtask immediatelyat hand.As broad-
coverageSLTsystemsin the neartermarelikely to remainmulti-modalratherthan
exclusively telephonic,they can and shouldtake advantageof this flexibility.
2.2.2. Correctionof Translation
The currentdemos were not intendedto demonstratethe full range of interactive

possibilities. In particular,while dictation results were correctedon-line as just
discussed,therewas no comparableattemptat interactivedisambiguationof trans-
lation. Thus, when ambiguitiesoccurred,the speakerhad no way to control or

check the translationresults.
Forexample,when the Englishpartnerconcludedone dialog by saying (3a), the
Frenchpartnersaw and heard(3b), which might be literallyrenderedas (3c).
(3)a. It was a pleasureworkingwith you.
b. C'etait unplaisirfonctionneravec vous.
c. It was a pleasurefunctioningwith you.

The word work, in other words, had been translatedas fonctionner, as would be
appropriatefor an inputlike (4).
(4) This programis not working.
Such translationerrorswere not disruptiveduring the demos: they were in-
frequent,and many of the errorswhich did appearmight be more amusingthan
bothersomein the sortof informalsocializing seen in most on-line chat today.
However, errors arising from lexical and structuralambiguities might well
be more numerousand more disruptivein future,more sensitive chat-translation
applications.Further,it seems doubtfulthat they can be eliminatedin near-term
systems aiming for both broadcoverage and high quality,even assumingeffect-
ive use of multipleknowledge sources like those describedbelow. Thus my own
guess is thatinteractiveresolutionof ambiguitiesduringchat translationwould in
fact prove valuable.Feedbackconcerningthe translation,via some form of back-
translation,would probablyproveuseful as well. Again, for discussionof possible
techniques,see Boitet (1996a) and Blanchon(1996).
But even grantingthat interactivecorrectioncould raise the quality of SLT,
would users be willing to supply it? There is some indicationthat the degree of
interactionnow requiredin the demos to correct dictationmay alreadybe near
the tolerable limit for chat as it is presently used (Flanagan, 1997). A healthy
skepticismconcerningthe practicalityof real-timetranslationcorrectionis thus
warranted.I suspect, however,that users' tolerancefor interactivecorrectionwill
turnout to depend on the applicationand the value of correcttranslation:to the
extent that real-timeMT can move beyond socializing into business, emergency,
military,or other relatively crucial and sensitive applications,user tolerancefor
interactioncan be expectedto increase.
Ultimately,though,questionsaboutthe trade-offbetweenthe burdenof interac-
tion andits worthshouldbe treatedas topics for research:using a specifiedsystem,
what level of qualityis requiredfor given applications(specifiedin termsof tasks
to be accomplishedwithin specified time limits), and what types and amountsof
interactionarerequiredon averageto achievethatqualitylevel? Clearly,until SLT
systems with translationcorrectioncapabilitiesare built, no such experimentswill
be possible.
156 MARKSELIGMAN
Having discussed the role of interactivedisambiguationin SLT, and having

describedtwo experimentswith highly interactiveSLT,we now turnon our heels
as forecasttowardmoreintegration-oriented studies.We begin with considerations
of SLTsystem architecture.
3. System Architecture
An ideal architecturefor "high road",or highly integrated,SLT systems would
allow global coordinationof, cooperationbetween, and feedback among, com-
ponents (SR, analysis, transfer,etc.), thus moving away from linear or pipeline
arrangements.For instance,SR, as it moves throughan utterance,should be able
to benefit from preliminaryanalysis results for segments earlierin the utterance.
The architectureshouldalso be modular,so thata varietyof configurationscan be
tried:it shouldbe possible, for instance,to exchange competingSR components;
and it shouldbe possible to combine componentsnot explicitly intendedfor work
together,even if these are writtenin differentlanguages or runningon different
machines.
Blackboardarchitectureshavebeen proposed(Erman& Lesser, 1990) to permit
cooperationamong components. In such systems, all participatingcomponents
read from and write to a central set of data structures,the blackboard.To share
this commonarea,however,the componentsmust all "speaka common(software)
language".Modularitythus suffers, since it is difficultto assemble a system from
componentsdeveloped separately.Further,blackboardsystems are widely seen
as difficultto debug, since control is typically distributed,with each component
determiningindependentlywhen to act and what actionsto take.
In order to maintainthe cooperativebenefits of a blackboardsystem while
enhancingmodularityand facilitatingcentralcoordinationor control of compon-
ents, Seligman and Boitet (1993) and Boitet and Seligman (1994) proposedand
demonstrateda "whiteboard"architecturefor SLT. As in the blackboardarchi-
tecture,a centraldata structureis maintainedwhich contains selected results of
all components.However,the componentsdo not access this whiteboarddirectly.
Instead,only a privilegedprogramcalled the Coordinatorcan readfromit andwrite
to it. Each componentcommunicateswith the Coordinatorand the whiteboardvia
a go-betweenprogramcalled a Manager,which handlesmessages to and from the
Coordinatorin a set of mailbox files. Because files are used as data-holdingareas
in this way, components(andtheirmanagers)can be freely distributedacrossmany
machines.9
Managersare not only mailmen, but interpreters:they translatebetween the
reserved language of the whiteboardand the native languages of the compon-
ents, which are thus free to differ. In our demo, the whiteboardwas maintained
in a commercialLisp-basedobject-orientedlanguage,while componentsincluded
independently-developed SR, analysis,and word-lookupcomponentswrittenin C.
Overall,the whiteboardarchitecturecan be seen as an adaptationof blackboard
NINE ISSUES IN SPEECHTRANSLATION 1 57
architecturesfor client-serveroperations:the Coordinatorbecomes the mainclient

for severalcomponentsbehavingas servers.
Since the Coordinatorsurveys the whiteboard,in which are assembled the
selected resultsof all components,all representedin a single softwareinterlingua,
it is indeed well situatedto providecentralor global coordination.However,any
degree of distributedcontrol can also be achieved by providing appropriate
programsalongside the Coordinatorwhich representthe components from the
whiteboardside. That is, to dilute the Coordinator'somnipotence,a numberof
demi-godscan be created.In one possible partlydistributedcontrol structure,the
Coordinatorwould oversee a set of agendas,one or morefor each component.
A closely relatedeffort to create a modular"agent-based"(client-server-style)
architecturewith a centraldatastructure,usable for many sorts of systems includ-
ing SLT,is describedby Juliaet al. (1997). Lackinga centralboardbut still aiming
in a similarspiritfor modularityin varioussorts of translationapplicationsis the
projectdescribedby Zajacand Casper(1997). Furtherdiscussionof SLTarchitec-
turefrom the alternativeviewpointof the Verbmobilsystem appearsin Gorz et al.
(1996). For discussion of a recent DARPAinitiativestressingmodularswitching
of componentsfor experimentation,see Aberdeenet al. (1999).
4. Data Structures
We have arguedthe desirabilityfor system coordinationof a centraldatastructure
whereselectedresultsof variouscomponentsare assembled.The questionremains
how thatdatastructureshouldbe arranged.The ideal structureshouldclarifyall of
the relevantrelationships,in particularclearing up the matterof representational
"levels"- a confusingtermwith severalcompetinginterpretations.
Boitet and Seligman (1994) presentedseveral argumentsfor the use of inter-
relatedlattices for maintainingcomponents'results. Here I presentone possible
elaboration, suggesting a multi-dimensionalset of structuresin which three
meaningsof "level"are kept distinct(Figure 1).
We first distinguishan arbitrarynumberof "Stagesof Translation",with each
stage viewable as a long scroll of paper extending across our view from left to
right.Left-rightis the time dimension,with earlierelementson the left. The Stage
0 scroll representsthe raw inputto the SLTsystem, includingfor examplethe un-
processedspeech inputfrom both speakersand the recordof one speaker'smouse
clicks on an on-screenmap, such as might be used for a direction-findingtask. In
its full extent from left to right, Stage 0 would thus include the raw input for a
translationsession once complete,for example,for a dialogueto be translated.
Stage 1 containsthe resultsof the firststage of processing,whateverprocesses
might be involved. This scroll, viewed as unrollingbehind Stage 0, might for in-
stance include twin sets of lattices representingthe results of phoneme spotting
within both speakers'raw input. Stages 2, 3, . . . , n unrollin turnbehind Stage 1,
recedingin depth.Stage 2 might include source-languagesyntactictrees; Stage 3
158 MARKSELIGMAN
Figure 1. Multi-dimensionaldata structuresfor speech translation.
mightincludesemanticstructuresderivedfromthese trees;and so on, throughMT

transferand generationto the final stage, a scroll behind all other scrolls, which
might contain translatedtext annotatedfor speech synthesis. Pointers (diagonal
light lines) would indicaterelationshipsbetweenelementsin subsequentstages.
Each stage can be subdividedboth verticallyandhorizontally.Verticalbounda-
ries (verticaldashedlines) representappropriatetime or segmentdivisions, prob-
ably includingutterances.Horizontaldivisions represent"Tracks",since at each
stageseveralseparablesignalsourcesmay be underconsideration.As alreadypoin-
ted out, Stage 1 in the figureincludestwo raw speech tracksand a trackindicating
mouseclicks on a map.Stage2 mightcontain,in additionto tracksfor the phoneme
lattices alreadymentioned,other tracks(hiddenfrom view) containingFOcurves
extractedfrom the respectivespeech signals. Different stages may have different
numbersof tracks,dependingon the processeswhich definethem.
Finally, within each track at a given stage, we can distinguishvarying levels
of "Height"on the page - that is, variousvalues on the y-axis correspondingto
given time values along the x-axis. These can be given variousinterpretationsas
appropriatefor the type of track in question. When the track contains syntactic
trees, height correspondsto syntacticrank,i.e., dominance,with dominantnodes
usuallycoveringlongertime spansthandominatedones.
Confusionregardingthe meaningof "level"bedevilsmanydiscussionsof MT:it
sometimesmeansa stage of processing,sometimesa mode or type of information,
and sometimes a gradationof dominance or span. The hope is that, by clearly

distinguishingthese meaningsas stages,tracks,or heightwithintracks,we can help
bothprogrammersandprogramskeep theirbearingsamid a welter of information.
The multi-dimensionalstructuresjust described bear some resemblance to
the three-dimensionalchartsof Barnettet al. (1990), used to trackrelationships
betweensyntacticandsemanticstructuresduringanalysisof queriesto CYC know-
ledge bases (Lenat& Guha, 1990). They were developedindependently,however.
Three-dimensionalcharts were restrictedto two depths or stages (syntactic and
semantic),lackedtracks,and made no explicit referenceto height or rank.
The whiteboarddemo reportedin Seligman and Boitet (1993) likewise made
only partialuse of the multi-dimensionalstructure:stages and height were expli-
citly representedand shownin the graphicaluser interface,with explicitrepresent-
ation of relationsbetween structuresin subsequentstages;but trackswere not yet
included.
5. Interface between SR and MT Analysis

In a certainsense, SR and analysisfor MT are comparableproblems.Both require
the recognitionof the most probablesequences of elements. In SR, sequences of
short speech segments must be recognized as phones, and sequences of phones
must be recognizedas words.In analysis, sequencesof words must be recognized
as phrases,sentences,and utterances.
Despite this similarity,currentSLT systems use quite differenttechniquesfor
phone, word, and syntactic recognition.Phone recognitionis generally handled
using HiddenMarkovModels (HMMs);word recognitionis often handledusing
Viterbi-stylesearchfor the best pathsin phonelattices;and sentencerecognitionis
handledthrougha varietyof parsingtechniques.
It can be argued that these differences are justified by differences of scale,
perplexity,and meaningfulness.On the other hand, they introducethe need for
interfacesbetweenprocessinglevels. The processorsmay thusbecomeblackboxes
to each other,when seamless connectionand easy communicationmight well be
preferable.In particular,word recognitionand syntacticanalysis (of phrases,sen-
tences, and utterances)should have a lot to say to each other:the probabilityof
a word shoulddependon its place in the top-downcontext of surroundingwords,
just as the probabilityof a phrase or larger syntactic unit should depend on the
bottom-upinformationof the wordswhich it contains.
To integrateSR andanalysismoretightly,it is possibleto employa single gram-
marfor both processes,one whose terminalsare phones and whose non-terminals
are words, phrases, sentences, etc.10This phone-groundedstrategywas used to
good effect for examplein the HMM-LRSR componentof the ASURASLTsystem
(Morimotoet al., 1993), in which an LR parserextendeda parsephone by phone
and left to right while buildinga full syntactictree.11The techniqueworkedwell
for scriptedexamples.For spontaneousexamples,however,performancewas un-
160 MARKSELIGM
AN
satisfactory,because of the gaps, repairs,and othernoise common in spontaneous

speech.To deal with such structuralproblems,an island-drivenparsingstyle might
well be preferable.An island-basedchartparser,like that of Stock et al. (1989),
would be a good candidate.
However, chart initializationpresents some technical problems. There is no
difficultyin computinga lattice from spotted phones, given informationregard-
ing the maximumgap and overlapof phones. But it is not trivialto convertthat
lattice into a "chart"(i.e., a multi-pathfinite-stateautomaton)withoutintroducing
spuriousextrapaths.The authorhas implementeda CommonLisp programwhich
does so correctly,based on an algorithmby C. Boitet (Seligmanet al., 1998a).The
algorithmtracks,for each node of an automatonunderconstruction,the latticearcs
which it reflects and the lattice nodes at their origins and extremities.An exten-
sion of the procedurepermitsthe inclusion of null, or epsilon, arcs in the output
automaton.The methodhas been successfullyappliedto lattices derivedfromdic-
tionaries,i.e., very large corporaof strings.(Full source code and pseudocodeare
availablefromthe author.)Experimentswith bottom-upisland-drivenchartparsing
fromchartsinitializedwith phones are anticipated.
6. Use of Pauses for Segmentation

It is widely believed that prosody can prove crucial for SR and analysis of spon-
taneous speech.12Several aspects of prosody might be exploited:pitch contours,
rhythm,volume modulation,etc. However,Seligmanet al. (1997) proposedfocus-
ing on naturalpausesas an aspectof prosodywhichis bothimportantandrelatively
easy to detect automatically.13
Given the frequencyof utterancesin spontaneousspeech which are not fully
well-formed- which contain repairs,hesitations,and fragments- strategiesfor
dividing and conqueringutteranceswould be quite useful. The suggestionis that
naturalpauses can play a partin such a strategy:that "pauseunits",or segments
within utterancesboundedby naturalpauses, can provide chunks which (a) are
reliablyshorterand less variablein length thanentireutterancesand (b) are relat-
ively well-behavedinternallyfrom the syntacticviewpoint,thoughanalysisof the
relationshipsamongthem appearsmore problematic.
Our investigationbegan with transcriptionsof four spontaneousJapanesedia-
logues concerninga simulateddirection-findingtask. The dialogues were carried
out in the EMMI-ATREnvironmentfor Multi-modalInteraction(Loken-Kimet
al., 1993; Furukawaet al., 1993), two using telephoneconnectionsonly, and two
employingon-screengraphicsand video as well. In each 3- to 7-minutedialogue,
a caller pretendingto be at Kyoto station received from a pre-trained"agent"
directionsto a conferencecenter and/orhotel. In the multimediasetup, both the
caller and agentcould drawon on-screenmaps and exchangetyped information.
Morphologicallytaggedtranscriptsof the conversationswere dividedinto turns
by the transcriber,and included hesitationexpressions and other naturalspeech
Figure 2. Interface used by the pause tagger.
features.We then added to the transcriptsinformationconcerningthe placement

and length of significantpauses. For our purposes,a significantpause was either
a junctureof any length where breathingwas clearly indicated(sometimes a bit
less than300 milliseconds)or a silence lasting approximately400 millisecondsor
more.
To facilitatepause tagging, we prepareda customized configurationof the X
waves speechdisplayprogram14 so thatit showedsynchronizedbutseparatespeech
tracksof both partieson screen (Figure2). The pause tagger,referringto the tran-
script, could use the mouse to draw labeled lines throughthe tracks indicating
the startsand ends of turns;the startsand ends of segmentswithin turns;and the
startsand ends of response syllables which occur duringthe other speaker'sturn.
Visual placementof labels was quite clear in most cases. As a secondaryjob, the
taggerinserteda specialcharacterinto a copy of the transcripttext whereverpauses
occurredwithinturns.
Aftertagging,labels bearingexact timinginformationwere downloadedto sep-
aratefiles. Because thereshouldbe a one-to-onemappingbetween labeledpauses
within turnsand markedpause locations in the transcript,it was then possible to
createaugmentedtranscriptsby substitutingaccuratepause-lengthinformationinto
the transcriptsat markedpause points.
In studyingthe augmentedtranscripts,four specific questionswere addressed:
1. Are pause units reliably shorterthanwhole utterances?If they were not, they
could hardly be useful in simplifying analysis. It was found however, that,
162 MARKSELIGMAN
in the corpus investigated,pause units are in fact about 60% the length of
entireutterances,on the average,when measuredin Japanesemorphemes.The
averagelengthof pause units was 5.89 morphemes,as comparedwith 9.39 for
whole utterances.Further,pause units are less variablein length than entire
utterances:the standarddeviationis 5.79 as comparedwith 12.97.
2. Wouldhesitationsgive even shorter,and thus perhapseven more manageable,
segmentsif used as alternateor additionalboundaries?The answerseems to be
thatbecausehesitationsso often coincide with pause boundaries,the segments
they markout are nearlythe same as the segmentsmarkedby pausesalone. No
combinationof expressionswas found which gave segments as much as one
morphemeshorterthanpause units on average.
3. Is the syntax within pause units relatively manageable?A manual survey
showed that, once hesitation expressions are filtered from them, some 90%
of the pause units studiedcan be parsedusing standardJapanesegrammars;a
varietyof special problemsappearin the remaining10%.
4. Is translationof isolatedpause units a possibility?We foundthata majorityof
the pauseunits in four dialoguesgave understandabletranslationsinto English
when translatedby hand.
The studyprovidedencouragementfor a "divideandconquer"analysisstrategy,
in whichparsingandperhapstranslationof pauseunitsis carriedout before,or even
without,attemptsto createcoherentanalysesof entireutterances.
As mentioned, parsabilityof spontaneousutterancesmight be enhanced by
filteringhesitationexpressionsfrom them in preprocessing.Researchon spotting
techniquesfor such expressionswould thus seem to be worthwhile.Researchers
can exploita speaker'stendencyto lengthenhesitations,andto use themjust before
or afternaturalpauses.
Use of pauseinformationfor "dividingutterancesinto meaningfulchunks"dur-
ing SLT of Japaneseis describedby Takezawaet al. (1999). Pauses are used as
segmentboundariesin severalcommercialdictationproducts,but no descriptions
are available.
7. Example-Based SLT
Example-basedMT (EBMT) (Nagao, 1984; Sato, 1991) is translationby ana-
logy. An EBMT system translatessource-languagesentences by referenceto an
"example base", or set of source-languageutterancespaired with their target-
language equivalents. In developing such a system, the hope is to improve
translationqualityby reusingcorrectandidiomatictranslations;to partlyautomate
grammardevelopment;and to gain insightinto languagelearning.
Two EBMT systems arenow being appliedto SLT:the TDMT (Transfer-driven
MT) system developedat ATR (Furuse& Iida, 1996; Iida et al., 1996; Sumita&
Iida, 1992), used in the ATR-MatrixSLT system (Takezawaet al., 1999); and the
PanEBMTsystem (Brown, 1996) of CMU, used along with transfer-basedMT
within the multi-engineMT architecturein the DiplomatSLT system (Frederking

etal., 1997).
Despite theircommonaims, the two systems differ substantially.The ATRsys-
tem aims to supplya completetranslationsingle-handed,and accordinglyincludes
a full parserfor utterancesand a hand-builtgrammar(set of language patterns)
to go with it. The CMU system, by contrast,operatesas a componentof a larger
system:in general,its aim is to supply possible partialtranslations,or translation
chunks, to be placed on a chart along with chunks suppliedby other translation
engines.15Forthis mission,the systemrequiresneitherparsernorgrammar,relying
instead on heuristicsto align sub-elementsof sentences in the example base at
trainingtime. Once it has put its chunks in place during translation,a separate
process, belonging to the Multi-EngineMT architecture,will employ a statistical
language model to select the best path throughthe pre-stockedchartin orderto
assemblethe finaloutput.
As the suggestions below relate to a tree-orientedand end-to-end view of
example-basedprocessing,the primaryconcern will be with systems of the ATR
type. We begin with a sketchof this methodology.
Considerthe Japanesenounphrase(5a). Its literaltranslationmightbe (5b), but
a more gracefultranslationwould be (5c) or (5d).
(5)a. kyotono kaigi
b. conferenceof Kyoto
c. conferencein Kyoto
d. Kyoto conference
We could hope to provide such improvedtranslationsif we had an example base
showingfor instancethat(6a) had been translatedas (6b) or (6c), andthat(7a) had
been renderedas (7b) or (7c).
(6)a. tokyono kaigi
b. conferencein Tokyo
c. Tokyoconference
(7)a. nyuyokuno kaigi
c. conferencein New York
d. New Yorkconference.
The strategywould be to recognize a close similaritybetween the new input (5a)
and these previously translatednoun phrases, based on the semantic similarity
164 MARKSELIGMAN
between kyoto on one hand and tokyo and nyu yoku on the other.The same sort
of patternmatchingcould be performedagainsta nounphrasein the examplebase
differingfrom the input at more than one point, for example (8), where miitingu
('meeting') is semanticallysimilarto kaigi ('conference').
(8). tokyono miitingu

'meetingin Tokyo'
At any numberof such comparisonpoints, semantic similarity of the relevant

expressionscan be assessed by referenceto a semantichierarchy- for example,
a type hierarchyof semantictags suppliedby a thesaurus.A thesaurusassociatesa
lexical item like kaigi with one or moresemantictags (e.g., CITY,SOCIAL-EVENT);
andthe similarityof two semantictags can be definedas the distanceone mustrise
in the relevantsemantichierarchyto reacha node which dominatesboth tags: the
further,the more semanticallydistant.The four-levelhierarchyof the Kadokawa
Ruigo Shin-jiten(Ohno & Hamanishi,1981) has been used in this way in several
studies.
Differenttranslationsof the Japanesegenitive no construction,for example as
the English possessive (tanaka-sanno kurunia,'Tanaka'scar') would be distin-
guished by the distinct semantictypes of their respectivecomparisonpoints - in
this case, for example,PERSON and VEHICLE.
By replacingeach comparisonpoint in an expressionlike (5a) with a variable,
we can obtaina patternlike (9a). Such patternscan be embedded,giving (9b) or
(9c).
(9)a. [?X no ?Y]
b. [[?X no ?Y] no ?Z]
c. [?X no [?Y no ?Z]]
If we then receive an input like (10) we can determinewhich bracketingis most

sensible - that is, we can parse the input - by extendingthe techniquesalready
discussedfor gauging semanticsimilarity.
(10) kyoto no kaigi no ronbun

Kyoto- gen conference- gen paper
paperat the conferencein Kyoto
One possibilityis to designatea "head"for each pattern,andto posit thata pattern's

overallsemantictype is the type of its head.Then semanticsimilarityscorescan be
calculatedbetweenan inputlike (10) and an entireset of embeddedpatterns- that
is, an entirepatterntree- by propagatingsimilarityscores outward(upward).One
NINEISSUESIN SPEECH
TRANSLATION 165
can calculatesimilarityscores for severalpossible bracketings(trees),and choose

the bracketingmost semanticallysimilarto the input. In this way, the calculation
of semanticsimilarityguides structuraldisambiguationduringanalysis.
Havingoutlinedthe essentialsof example-basedprocessingin the tree-oriented
style, we are now ready to discuss possible elaborations.The first involves the
degreeof separationbetween stages of EBMT.
7.1. SEPARATION OF EXAMPLE-BASED ANALYSIS, TRANSFER, AND

GENERATION
Recall that semanticsimilaritycalculationcan be used to select an embeddedset

of patterns(a parsetree) from among severalcompetitors.If each source-language
pattern(i.e., subtree)is associated with a unique target-languagepatternwhich
providesits translation,then the selection of a complete source-languagetree will
simultaneouslyand automaticallyprovidea correspondingtarget-languagetree. In
this way, an example-basedanalysisprocess can be madeto provideautomatically
a transferprocess as well - that is, a mappingof source-languagestructuresinto
target-languagestructures.TDMT intentionallycombines analysis and transferin
this way. The combinationis seen as an advantage:the same mechanismwhich
handlesstructuraldisambiguationsimultaneouslyselects the righttranslationfrom
among severalcandidates.However,the combinationof phases does raise issues
concerningthe role of transferin handling translationambiguity and structural
mismatches.
First, some translationapplicationsmay requirean explicit accountof transla-
tion ambiguity- thatis, of the possibilityof translatinga given subtreeor node in
more than one way. For such applications,transfermight be treatedas a separate
phase of translationfrom source-languageparsing.Thatis, since considerationsof
semantic similaritycan guide the selection of target structure- just as they can
guide the choice of analysis tree - we can recognize the possibility of example-
based transferas separatefrom example-basedanalysis. Furthermore,depending
on the depth of analysis, even once a target-languagetree has been selected,
ambiguitymay arise in selecting target-languagesurfaceforms to express it. Thus
a separateexample-basedgenerationphase also becomes a possibility.
A second issue relates to structural mismatches between source and
target.Should they be handledin the transferphase of translation?Considerthe
translationsin (1 1)-(13), for example.
(11) Zo wa, hana ga nagai.

ELEPHANTtopic NOSEsubj BE-LONG
Elephants have long noses.
166 MARKSELIGMAN
(12) Taeko wa, kaminokewa nagai

(NAME)topic HAIR subj BE-LONG
Taeko'shairis long.
Taekohas long hair.
(13) Watashiwa, taeko ga sukidesu

ME topic, (NAME)subj IS-LOVED
I like/love Taeko.
In these cases, language-internalconsiderationsdictate non-flatanalyses on both
the source and target sides. However,in each case, the source tree is differently
configuredfromthe targettree.Thus,to representthe correspondencescompletely,
it is insufficientsimplyto mapone sourcenode (one sourcepattern)onto one target
node (one targetpattern);rather,we need to intermaparbitrarysubtreeconfigura-
tions (embeddedpatternsets). In currentimplementationsof tree-orientedEBMT,
such generalmappingsbetween subtreesare not supportedduringtransfer;rather,
they arehandledby special-purposepost-processingroutines.It mightproveeasier
to arrangea moregeneraltreatmentfor such intermappingsif transferwere treated
as a separatetranslationphase.
An experimentreportedby Sobashima and Iida (1995) and Sobashimaand
Seligman (1994) takes a first step toward clear separationof EBMT phases: it
presentsan example-basedtreatmentof analysisonly. (Furtherinformationis given
below.) A distinctexample-basedtransferphase includingfacilities for intermap-
ping embeddedpatternswas envisaged,but has not yet been implemented.
7.2. MULTIPLE DIMENSIONS OF SIMILARITY
So far we have discussed the measurementof similarityalong the semanticscale

only. But utterancesand structurescan be comparedalong other dimensions as
well. Thus for example,when assessing the similaritybetweena given patternand
the inputpatternto be translated,we could ask not only how semanticallysimilar
its elementsare to those of the inputpattern,but how syntacticallysimilaras well,
or how graphologicallyorphonologicallysimilar.
Sobashimaand Iida (1995) and Sobashimaand Seligman(1994) describefacil-
ities for measuringand combiningseveralsorts of similarity.Syntacticsimilarity,
for instance, is measuredwith referenceto a syntactic ontology, comparableto
the thesaurus-basedsemantic hierarchydiscussed above; and a score indicating
overall similarityof respectivevariableelements in two patternsis calculatedby
combiningsyntacticand semanticsimilarityscores. The reportedimplementation
also considered,as a factor in overall similarity,a score indicatinggraphological
similarity:1 for a completematch,and 0 in othercases. Futureversions,however,
mightinsteadmeasurephonologicalsimilarity- for instance,by meansof a phone-
type ontology indicating,for example,that [/] and [t f] are similarsounds,while
NINEISSUESIN SPEECH
TRANSLATION 167
[/] and [k] are more different.Below, we brieflyindicatehow multiplesimilarity

dimensionsenteredinto the calculationof overallsimilarity.
Once we recognize the possibility of consideringphonological similarityas a
factor in overall similaritybetween patterns,we move EBMT beyond text trans-
lation into the area of SLT.We could, for instance, attemptto disambiguatethe
speech act of an utteranceby comparingthe prosodic contours of its elements
with the contoursof elements of labeled utterancesin a database.Such prosodic
comparisonsmight help, for example, to distinguishpolitely hesitant statements
and yes-no questionsin Japanese.These utterancetypes are syntacticallymarked
by finalparticlesga andka, which arephonologicallyquitedifficultto distinguish;
theirprosodies,however,tend to be quite distinct.
In any case, use of similaritymeasurementsalong multiple dimensionsas an
aid to disambiguationwould be very much in the spirit of the "high road",or
integrative,approachto SLTdiscussedthroughout.16
7.3. BOTH TOP-DOWN AND BOTTOM-UP
In most currentexample-basedsystems, the applicabilityof a patternis judged by

the semanticmatchof its sub-elementsagainstthose of the input.These arebottom-
up similarityjudgments:the sub-elementsprovideevidencefor the presenceof the
patternas a whole. Usuallyabsent,however,arecorrespondingtop-downsimilarity
judgments whereby the patternsgive evidence for the sub-elements.Sobashima
and Iida (1995) and Sobashimaand Seligman (1994), however, do demonstrate
applicationof both bottom-upand top-downsimilarityconstraints.Further,simil-
arityis measuredin both directionsalong severaldimensions(syntactic,semantic,
and others),as suggestedabove.We now brieflydescribethe method.
First, some necessary background.Considera "linguisticexpression",which
may be eitheratomicor complex. Complexexpressionsare composedof variables
and/orfixed lexical elements, as in (9a) above. We calculate the "elementalsim-
ilarity",or E-Sim, of two expressionsas a combined function of their syntactic,
semantic,andphonologicalor graphologicalsimilarites.17
Now we arereadyto considertop-downvs. bottom-upsimilaritymeasurement.
We calculatethe "structuralsimilarity",or "bottom-upsimilarity",of two complex
expressionsby combiningthe elemental similaritiesof their respectiveelements.
By contrast,the top-down factor in the similarityof two expressions A and B
is a measureof the similarityof their respectivecontexts. We call this factor the
"contextualsimilarity"C-Sim of expressionsA and B, and calculateit as the sum
of the elementalsimilaritiesof theirrespectiveleft and rightneighborexpressions
L and R (14).
(14) C-Sim(A, B) = E-Sim(LA, LB) + E-Sim(tfA, RB)

168 MARKSELIGMAN
The final, or integrated,similarityscore Sim for expressions Si and S2, then,

is the combinationof their structural(bottom-up)similarityand their contextual
(top-down)similarity(15).
(15) Sim(S!, S2) = S-Sim(Si, S2) * C-Sim(Si, S2)
We have seen thatSim incorporatesmulti-dimensionalsimilaritymeasurements

applied both top-down and bottom-up.The next question is how to apply this
scorefor example-basedanalysis.We now outlinethe methodproposedin the cited
papers.
7.4. ANALYSIS WITH MULTI-DIMENSIONAL, TOP-DOWN AND BOTTOM-UP

SIMILARITY MEASUREMENTS
We can considerthe trainingstage first.In this stage, an examplebase is prepared

by bracketingand labeling the trainingcorpus by hand. The labeling entry for a
complex expressionincludes the numberof elements it contains;the set of syn-
tactic,semantic,andotherclassifyingfeaturesof the complex structureas a whole;
the classifyingfeaturesof each sub-element;andthe classifyingfeaturesof the left
andrightcontexts.
Now on to the analysisitself. After morphologicalprocessing,with access to a
lexicon giving classifyingfeaturesets (perhapsmultiplesets) for each terminal,the
mainroutineproceedsas follows:
1. Search the example base for the expression most similar to any contiguous
subsequencein the input:find the longest similarmatchesfrom position 1 in
the input, then from position 2, and so on, terminatingif a perfect match is
found.
2. Reduce, or rewrite,the covered subsequence,passing its similarityfeaturesto
the rewrittenstructure.
3. Go to 1. Continuethe cycle until no furtherreductionis possible.
A preliminaryexperimentwas conducted on 132 English and 129 Japanese
sentences.This corpus was too small to permitmeaningfulstatisticalevaluation,
but we can say that numeroussentenceswere successfully analyzedwhich might
have yielded massive structuralambiguity,for example(16).
(16) However,we do have single rooms with a showerfor eighty dollarsand

nightandtwin roomswith a bathfor a hundredandfortydollarsa night.
Here,manyspuriouscombinations,e.g., showerfor eightydollars a nightand twin

rooms,were ignoredin favor of the properinterpretations.Successful analysis of
varioususes of the article a was particularlynotable. A full trace appearsin the
cited papers.
7.5. SIMILARITY VS. FREQUENCY
We have been discussing the uses of similaritycalculationsfor the resolutionof

various sorts of ambiguity.We conclude this section by contrastingsimilarity-
baseddisambiguationandprobability-baseddisambiguation,an approachwhich is
morewidely studiedat present.Severalcurrentparsers(e.g., Black et al., 1993) are
trainedto resolve conflicts amongcompetinganalysesby using informationabout
the relativefrequencies,and thus probabilities,of the combinationsof elementsin
question. At short range, /i-gram statistics are used; at longer ranges, stochastic
rules.
Several of the considerationsraised above with respect to similarity-based
disambiguationapply equally to probability-baseddisambiguation.For example,
Jurafsky(1993) stressesthe need for multidimensionalprocessing:in his parser-
based upon the theory of grammaticalconstructions(Fillmoreet al., 1988; Kay,
1990) and claimed to model several features of human parsing as observed in
psycholinguisticexperiments- semanticas well as syntacticfrequenciesandprob-
abilitiesare broughtto bearin selecting the properparse.Also stressedis the need
for both top-downandbottom-upstatisticsin evaluatingparsetrees as a whole.
Ideally,disambiguationapproachesbaseduponsimilarityandapproachesbased
upon occurrenceprobabilityshouldcomplementeach other.However,I am aware
of no attemptsto combinethe two.
8. Cue-based Speech Acts

Speech-act analysis (Searle, 1969) - analysis in terms of illocutionaryacts like
INFORM, WH-QUESTION, REQUEST, and so on - can be useful for SLTin numer-
ous ways. Six uses, threerelatedto translationand threeto speech processing,will
be mentionedhere.Concerningtranslation,the following tasksmustbe performed:
1. Identifythe speech acts of the currentutterance.Speech-act analysis of the
currentutteranceis necessaryfor translation.For instance,the Englishpattern
can you (VP, bare infinitive)?may express either an ACTION-REQUEST or a
YN-QUESTION (yes-no question).Resolution of this ambiguity will be crucial
for translation.
2. Identifyrelated utterances.Utterancesin dialogues are often closely related:
for instance,one utterancemay be a promptand anotherutterancemay be its
response;and the propertranslationof a response often depends on identifi-
cation and analysisof its prompt.For example,Japanesehai can be translated
as yes if it is the response to a YN-QUESTION, but as all right if it is the
responseto an ACTION-REQUEST. Further,the syntaxof a promptmay become
a factorin the finaltranslation.Thus,in a respondingutterancehai, so desu (lit.
'yes, that'sright'), the segmentso desu may be most naturallytranslatedas he
can, you will, she does, etc., dependingon the structureand content of the
promptingquestion. The recognitionof such prompt-responserelationships
will requireanalysisof typical speech-actsequences.
170 MARKSELIGMAN
3. Analyze relationshipsamong segments and fragments. Early processing of

utterancesmay yield fragmentswhich must later be assembled to form the
global interpretationfor an utterance.Speech-act sequence analysis should
help fit fragmentstogether,since we hope to learnabouttypical act groupings.
Concerningspeech processing,it is necessaryto do the following:
4. Predict speech acts to aid SR. If we can predictthe coming speech acts, we
can partlypredicttheir surfacepatterns.This predictioncan be used to con-
strainSR. As alreadymentioned,for instance,Japaneseutterancesending in
ka and ga - respectively,YN-QUESTlONs and INFORMS - are difficultto dis-
tinguishphonologically.We earlierconsideredthe use prosodicinformation
of
in resolving this uncertainty.Predictionsas to the relativelikelihood of these
speech acts in a given context shouldfurtheraid recognition.
5. Provide conventionsfor prosody recognition. Once spontaneous data is
labeled, SR researcherscan try to recognize prosodic cues to aid in speech-
act recognitionand disambiguation.For instance, they can try to distinguish
segmentsexpressinginforms and YN-QUESTlONs accordingto the F0 curves
associated with them - a distinction which would be especially useful for
recognizingyn-QUESTIONs with no morphologicalor syntacticmarkings.
6. Provide conventions for speech synthesis. Similarly, speech synthesis
researcherscan try to providemore naturalprosodyby exploiting speech-act
information.Oncerelationsbetweenprosodyandspeechacts havebeen extrac-
ted from corporalabeled with speech-actinformation,researcherscan attempt
to supplynaturalprosodyfor synthesizedutterancesaccordingto the specified
speech acts. For instance, more naturalpronunciationscan be attemptedfor
YN-QUESTlONs, or for CONFIRMATION-QUESTIONS (includingtag questions
in English, as in
(17) The traingoes east, doesn't it?
While a well-foundedset of speech-actlabels would be useful, it has not been

clear what the theoreticalfoundationshouldbe. As a result,no speech-actset has
yet become standard,despiteconsiderablerecenteffort.18Labelsare still proposed
intuitively,or by trialand error.
Speakers'goals can certainlybe analyzedin manyways. However,Seligmanet
al. (1995) hypothesizethat only a limited set of goals is conventionallyexpressed
in a given language.For just these goals, relativelyfixed expressivepatternsare
learnedby speakerswhen they learn the language. In English, for instance, it is
conventionalto express certainsuggestions or invitationsusing the patternsLet's
V or Shall we V? In Japanese,one conventionallyexpresses similargoals via the
patternsV[combiningsitm]mashoor V[combiningstemjmasenka?
The proposalis to focus on discoveryand exploitationof these conventionally
expressible speech acts, or "Cue-basedSpeech Acts" (CBSAs).19 The relevant
expressive patternsand the contexts within which they are found have the great
virtue of being objectivelyobservable;and assumingthe use of these patternsis
common to all native speakers,it should be possible to reach a consensus clas-

sificationof the patternsaccordingto their contextualizedmeaningand use. This
functionalclassificationshould yield a set of language-specificspeech-actlabels
which can help to put speech-actanalysisfor SLTon a firmerfoundation.
The firstreasonto analyzespeechacts in termsof observablelinguisticpatterns,
then, is the measureof objectivitythus gained: the discovery process is to some
degreeempirical,data-driven,or corpus-based.A second reasonis thatautomated
cue-basedanalysis,being shallow or surface-bound,should be relativelyquick as
opposedto plan-basedanalysis.Plan-basedanalysismay well provenecessaryfor
certainpurposes,butit is quiteexpensive.For applicationslike SLTwhich mustbe
carriedout in nearlyreal time, it seems wise to exploit shallow analysis as far as
possible.
Withthese advantagesof cue-basedprocessing- empiricalgroundingandspeed
- come certainlimitations.When analyzingin termsof CBSAs, we cannotexpect
to recognizeall communicativegoals. Instead,we restrictour attentionto commu-
nicativegoals which can be expressedusing conventionallinguistic cue patterns.
Communicativegoals whichcannotbe describedas CBSAs includeutterancegoals
which are expressednon-conventionally(comparethe non-conventionalwarning
(17a) to the conventionalWARNING (17b); or goals which are expressedonly im-
plicitly ((18) as an implicit requestto shut the window); or goals which can only
be definedin termsof relationsbetweenutterances.20
(17)a. May I call your attentionto a potentiallydangerousdog?
b. Look out for the dog!
(18) It's cold outside.
Given thatthe aim is to classify expressivepatternsaccordingto theirmeaning

andfunction,how shouldthis be done?Seligman(1991) andSeligmanet al. (1995)
describea paraphrase-based approach:nativespeakersarepolled as to the essential
equivalenceof expressivepatternsin specifieddiscoursecontexts.If by consensus
severalpatternscan yield paraphraseswhich arejudged equivalentin context,and
if the resultingpatternset is not identicalto anycompetingpatternset, thenit can be
consideredto define a CBSA. (Knottand Dale (1995) and Knott (1996) describe
a similar substitution-basedapproachto the discovery of discourse relations, as
opposedto speech acts.)
CBSAs are definedin termsof monolingualconventionsfor expressingcertain
communicativegoals using certaincue patterns.Fortranslationpurposes,however,
it will be necessaryto comparethe conventionsin one languagewith those in the
other.With this goal in mind, the discovery procedurewas applied to twin cor-
poraof Japanese-JapaneseandEnglish-Englishspontaneousdialoguesconcerning
transportation directionsand hotel accommodation(Loken-Kimet al., 1993). CB-
SAs were first identified accordingto monolingualcriteria.Then, by observing
172 MARKSELIGMAN
translationrelations among the English and Japanesecue patterns,the resulting

EnglishandJapaneseCBSAs were compared.Interestingly,it was foundthatmost
of the proposedCBSAs seem valid for bothEnglish andJapanese:only two out of
27 seem to be monolingualfor the corpusin question.
We have been outlininga cue-basedapproachto recognitionof speech or dis-
course acts, with the assumptionthat some sort of parsing would be employed
to recognize cue patterns.This methodologycan be comparedwith statisticalre-
cognition approaches:speech- or discourse-actlabels are posited in advance,and
statisticalmodels are subsequentlybuilt which attemptto identifythe acts accord-
ing to their sequence (Reithinger,1995; Nagata& Morimoto, 1993) or according
to the wordsthey contain(Alexanderssonet al., 1997; Reithinger& Klesen, 1997).
Certainspeech-act sequences may indeed turn out to be typical; and certain
words may indeed prove to be unusuallycommon in, and thus symptomaticof,
arbitrarilydefined speech acts. Thus statistical techniques are indeed likely to
be helpful for recognitionof conventionalspeech acts when they are implied or
expressed non-conventionally,or for recognition of speech acts which are not
conventionalbut neverthelessappearuseful for some applications.Further,even
for conventionalspeech acts which are conventionallyexpressed,efficiency con-
siderationsmay sometimes favor statisticalrecognition techniques over pattern
recognition:once CBSAs were identified using our methods and a sufficiently
large trainingcorpushad been hand-labeled,statisticalmodels might certainlybe
built to permitefficient identificationin context. However,statisticalrecognition
approachesalone cannot provide a principledway to discover (that is, posit or
hypothesize)the labels in the first place, and this is what we seek. The current
CBSA set has been appliedin three studies:Black (1997) attemptedto associate
speech acts, including CBSAs, with intonationcontours in hopes of improving
speech synthesis; Iwaderaet al. (1995) employed the CBSA set in attemptsto
parsediscoursestructure;and Jokinenet al. (1998) used CBSAs in topic-tracking
experiments.
9. Tracking Lexical Co-occurrences

In the processing of spontaneouslanguage, the need for predictionsat the mor-
phological or lexical level is clear. For bottom-upparsing based on phones or
syllables, the numberof lexical candidatesis explosive. It is crucial to predict
which morphologicalor lexical items are likely so thatcandidatescan be weighted
appropriately.(Comparesuch lexical predictionwith the predictionsfrom CBSAs
discussed above. In general,it is hoped that by predictingCBSAs we can in turn
predictthe structuralelements of their cue patterns.We are now shifting the dis-
cussion to the predictionof open-classelements instead.The hope is that the two
sortsof predictionwill provecomplementary.)
A^-gramsprovide such predictions only at very short ranges. To support
bottom-upparsingof noisy materialcontaininggaps and fragments,longer-range
NINEISSUESIN SPEECH
TRANSLATION 173
predictions are needed as well. Some researchershave proposed investigation

of associationsbeyond the rc-gramrange, but the proposed associations remain
relativelyshort-range(aboutfive words). While stochasticgrammarscan provide
somewhat longer-range predictions than /i-grams, they predict only within
utterances.Our interest,however, extends to predictionson the scale of several
utterances.
Thus Seligman (1994a) and Seligman et al. (1999) propose to permit the
definitionof windowsin a transcribedcorpuswithinwhich co-occurrencesof mor-
phologicalor lexical elementscan be examined.A flexibleset of facilities(CO-OC)
has been implementedin CommonLisp to aid collection of such discourse-range
co-occurrenceinformationand to providequick access to the statisticsfor on-line
use.
A window is defined as a sequence of minimal segments, where a segment
is typically a turn, but can also be a block delimited by suitable markersin the
transcript.
Sparsedata is somewhatless problematicfor long-rangethan for short-range
predictions,since it is in generaleasierto predictwhatis coming "soon"thanwhat
is comingnext. Even so, thereis neverquiteenoughdata;so smoothingwill remain
important.CO-OCcan supportvariousstatisticalsmoothingmeasures.However,
since these techniques are likely to remain insufficient, a new technique for
semantic smoothing is proposed and supported: researchers can track co-
occurrencesof semantic tokens associated with words or morphs in additionto
co-occurrencesof the words or morphsthemselves. The semantictokens are ob-
tained from standardon-line thesauri.The benefits of such semantic smoothing
appearespecially in the possibilityof retrievingreasonablesemantically-mediated
associationsfor morphswhich are rareor absentin a trainingcorpus.
Sections9.1-9.3 describeCO-OC'soperationsin somewhatgreaterdetail.Sec-
tion 9.4 sketchespossible applicationsfor the co-occurrenceinformationharvested
by the program.
9.1. WINDOWS AND CONDITIONAL PROBABILITIES
As mentioned,we firstpermitthe investigatorto define minimal segmentswithin

the corpus: these may be utterances,sections bounded by pauses or significant
morphemessuch as conjunctions,hesitations,postpositions,and so on. Windows
composed of several successive minimal segments can then be recognized:Let
5/ be the currentsegment and N be the numberof additionalsegments in the
window as it extends to the right. N=2 would, for instance, give a window three
segmentslong with 5/ as its first segment.Then if a given word or morphemeM\
occurs (at least once) in the initial segment, 5/, we attemptto predictthe other
wordsor morphemeswhich will co-occur (at least once) anywherein the window.
Specifically,a conditionalprobabilityQ can be definedas in (20),
(20) Q(MU M2) = P(M2 e St U SM U S/+2U • • • U Si+N\Ml e St)
174 MARKSELIGMAN
whereMj aremorphemes,Sj areminimalsegments,andN is the widthof window

in segments.Q is thusthe conditionalprobabilitythatM2is an elementof the union
of segments Si, S;+i, 5/+2, and so on up to Si+N, given that M\ is an element of
Si, Both the segment definitionand the numberof segmentsin a window can be
adjustedto varythe rangeover which co-occurrencepredictionsare attempted.
For initial experiments,we used a morphologicallytagged corpusof 16 spon-
taneousJapanesedialogues concerningdirection-findingand hotel arrangements
(Loken-Kimet al., 1993). We collected common-noun/common-noun, common-
noun/verb,verb/common-noun, and verb/verbconditional probabilities a three-
in
segment window (N=2). Conditional probability Q was computed among all
morph pairs for these classes and stored in a database;pairs scoring below a
threshold(0. 1 for the initial experiments)were discarded.We also computedand
storedthe mutualinformationfor each morphpair,using the standarddefinitionas
inFano(1961).
Fastqueriesof the databasearethenenabled.A centralfunctionis get-MORPH-
WINDOW-MATES, which provides all the window mates for a specified morph
whichbelongto a specifiedclass andhave scoresabovea specifiedthresholdfor the
specifiedco-occurrencemeasure(conditionalprobabilityor mutualinformation).
The intentis to use such queriesin realtime to supportbottom-up,island-driven
SR and analysis. To supportthe establishmentof island centersfor such parsing,
we also collect informationon each corpus morphin isolation:its hit count and
the segmentsit appearsin, its unigramprobabilityandprobabilityof appearancein
a given segment,etc. Once island hypotheseshave been establishedbased on this
foundation,co-occurrencepredictionswill come into play for islandextension.
9.2. SEMANTIC SMOOTHING
As mentioned,CO-OCsupportsthe use of standardstatisticaltechniques(Nadas,

1985) for smoothing both conditional probabilityand mutual information.In
addition,however,we enable semanticsmoothingin an innovativeway. Thesaurus
categories- "cats"for short- are sought for each corpusmorph(and storedin a
corpus-specificcustomizedthesaurusfor fast access). The commonnoun eki ('sta-
tion'), for instance,has amongothersthe cat label "725a"(representinga semantic
class of POSTS-OR-STATIONS in the standardKadokawaJapanesethesaurus(Ohno
&Hamanishi, 1981).
Equippedwith such information,we can study the co-occurrencewithin win-
dows of cats as well as morphs. For example, using N=2, GET-CAT-WINDOW-
MATES finds 36 cats co-occurringwith "725a",one of the cats associatedwith eki,
with a conditionalprobabilityQ > 0.10, including"459a"(sewa 'takingcare of
or 'lookingafter'),"216a"(henko 'transfer'),and "315b"(ori 'gettingoff). Since
we have preparedan indexedreversethesaurusfor our corpus,we can quicklyfind
the corpusmorphswhichhavethese cat labels,respectivelymiru'look', mieru'can
see/be visible', magaru'turn',andoriru 'get off. The resultingmorphsarerelated
to the inputmorpheki via semanticratherthanmorph-specific co-occurrence.They

thus form a broader,smoothedgroup.
This semantic-smoothingprocedure - morph to related cats, cats to co-
occurringcategorywindow-mates,cats to relatedmorphs- has been encapsulated
in the function GET-MORPH-WINDOW-MATES-VIA-CATS. It permitsfiltering,so
thatmorphsareoutputonly if they belong to a desiredmorphologicalclass andare
mediatedby cats whose co-occurrencelikelihoodis above a specifiedthreshold.
Thesauruscategoriesare often arrangedin a type hierarchy.In the Kadokawa
thesaurus,there are four levels of specificity: "725a" (POSTS-OR-STATIONS),
mentioned above, belongs to a more general category "725" (STATIONS-AND-
HARBORS), which in turnbelongs to "72"(INSTITUTIONS), which belongs to "7"
(SOCIETY). Accordingly,we need not restrictco-occurrenceinvestigationto cats
at the level given by the thesaurus.Instead,knowing that "725a"occurredin a
segment5/, we can inferthatall of its ancestorcats occurredthereas well; and we
can seek andrecordsemanticco-occurrencesat every level of specificity.This has
been done; and GET-MORPH-WINDOW-MATES-VIA-CATS has a parameterpermit-
ting specificationof the desired level of semantic smoothing. The more abstract
the level of smoothing,the broaderthe resultinggroup of semantically-mediated
morphemeco-occurrences.The most desirablelevel for semantic smoothingis a
matterfor futureexperimentation.
9.3. EVALUATION
We arepresentlyreportingthe implementationof facilitiesintendedto enablemany
experimentsconcerningmorphologicaland morpho-semantic co-occurrence;the
experimentsthemselves remain for the future. Clearly, furthertesting is neces-
sary to demonstratethe reliability and usefulness of the approach.A principle
aim would be to determinehow large the corpus must be before consistent co-
occurrencepredictionsare obtained. Nevertheless, some indication of the basic
usabilityof the datais in order.
Tools have been providedfor comparingtwo corporawith respect to any of
the fields in the recordsrelatingto morphs,morphco-occurrences,cats, or cat co-
occurrences.Using these, we treated15 of our dialogues as a trainingcorpus,and
the one remainingdialogueas a test corpus.We comparedthe two corporain terms
of conditionalprobabilitiesfor morphco-occurrences.In both cases, statistically
unsmoothedscores were used for simplicityof interpretation.
We found 5,162 co-occurrencepairs above a conditionalprobabilitythreshold
of 0.10 in the trainingcorpusand 1,552 in the test. Since 509 pairsoccurredin both
corpora,the trainingcorpuscovered 509 out of 1,552, or 33%, of the test corpus.
Thatis, one thirdof the morphco-occurrenceswith conditionalprobabilitiesabove
0.10 in the test corpuswere anticipatedby the trainingcorpus.
This coverageseems respectable,consideringthatthe trainingcorpuswas small
and that neither statistical nor semantic smoothing was used. More important
176 MARKSELIGMAN
than coverage, however, is the presence of numerouspairs for which good co-
occurrencepredictionswere obtained. Such predictionsdiffer from those made
using n-gramsin thatthey need not be chained,and thus need not cover the input
to be useful:if consistentlygood co-occurrencepredictionscan be recognized,they
can be exploitedselectively.
The figuresobtainedfor cats and cat co-occurrencesare comparable.
9.4. POSSIBLE APPLICATIONS
A weighted co-occurrencebetween morphemesor lexemes can be viewed as an

associationbetween these items; so the set of co-occurrenceswhich CO-OCdis-
covers can be viewed as an associativeor semanticnetwork.Spreadingactivation
within such networksis often proposedas a method of lexical disambiguation.21
Thus disambiguationbecomes a second possible applicationof CO-OC'sresults,
beyondthe abovementionedprimaryuse for constrainingSR.22
A thirdpossible use is in the discoveryof topic transitions:we can hypothesize
thata spanwithina dialoguewherefew co-occurrencepredictionsare fulfilledis a
topic boundary.23Once the new topic is determined,appropriateconstraintscan be
exploited,for exampleby selecting a relevantsubgrammar.
10. Translation Mismatches

During translation,when the source and target expressions contain differing
amountsof information,a "translationmismatch"is said to occur.Forexample,the
English sentence (21a) may be translatedby Japanese(21b). In this case, because
the explicit pronounis suppressed,informationconcerningperson and numberis
lost. Similarly,(22a) may be translatedas (22b). Here, the pronounis once again
suppressed,and informationaboutthe object of the verb is lost as well: Japanese
does not expresseitherits numberor its definiteness.
(21)a. He ate.
b. Tabemashita.
EAT-past
(22)a. He boughtthe books.

b. Hon o kaimashita.
BOOKObjBUY-past
Suppressingsuch informationduringtranslationis less difficultthanarranging

for its additionwhen translatingin the opposite direction.When translatingfrom
Japaneseto English,for instance,how is a programto determinewhetheran entity
is definite,or plural,or third-person?Of course, such problemsare not uniqueto

SLT - they are equally presentin text translation.In spoken translation,though,
thereis the addeddifficultyof resolvingthemin real time.
The first observationwe can make about mismatchresolution is that it is in
some respectsakin to ambiguityresolution.In both cases, informationis missing
which mustbe suppliedsomehow:in translationmismatches,missing information
mustbe filled in; in ambiguityresolution,missing informationmustguide a choice.
In light of this similarity,the interactiveresolutiontechniquessuggestedabove for
ambiguityresolution can be suggested for mismatches as well. For example, it
would be relatively straightforwardto put up a menu offering a choice between
singularand plural- or "one vs. many",etc. Granted,other sorts of information,
for example concerningdefiniteness,would be trickierto elicit in non-technical
terms.24Of course, handlingmany such requests would be tedious, so interface
design would be crucial.And again, the hope is that the need for interactionwill
shrinkas knowledge-source integrationadvances.
A second observationaboutmismatchresolutionis that,when missing inform-
ation cannot be accuratelycomputedand is excessively burdensomefor users to
supply,it can simply be left missing. For example, for translating(21b) when the
correctEnglishwould be (21a), the incompletetranslation(23) could be produced.
This brokenEnglish would at least allow the hearerto infer the correctmeaning
fromcontext,moreor less as a hearerof the originalJapanese(or a hearerof "real"
brokenEnglish) would have to do. Further,supplyinginsufficientinformationis
usually better than supplying incorrectinformation:in the same situation, (23)
wouldbe far less confusingthan,(24), say, which is wrongon severalcounts.Thus
far, however,I am awareof no SLTprogramswhich deliberatelyabstainwhen in
doubt.
(23) * boughtbook
(24) I boughta book.
Ideally, however, translationsoftware will do its best to resolve mismatches

before requestinghelp from the user or throwing in the towel. Researchersin
this areahave tendedto createprogramsfocusing on a specific sort of mismatch.
For example, Murataand Nagao (1993) proposedan expert system for supplying
numberand definitenessinformation,and thus articles, duringJapanese-English
translation.
In a similar spirit, Seligman (1994b) describes a programfor resolving the
references of zero pronounsin the Asura SLT system (Morimotoet al., 1993),
thus supplyingthe missing pronounsfor translation.The program,based upon the
theory of centering(Sidner, 1979; Grosz et al., 1983; Joshi & Weinstein, 1981;
Takeda& Doi, 1994), follows unpublishedwork by MasaakiNagata.It is invoked
from within specially modified transferrules for verbs, and can work alongside
178 AN
MARKSELIGM
other pronoun-resolutiontechniques,for example those making use of Japanese

honorificinformation(Dohsaka,1990). No evaluationshave yet been made.
Othermismatchproblemsto be addressedaresurveyedin Seligmanet al. (1993)
in the context of Japanese-Englishor Japanese-Germantransfer.These include
the determinationof tense (Japanese,for example,does not have an explicit future
tense); aspect (Japaneselacks explicit cues which would license a choice between
(25a, b)); intimacy(as requiredfor a choice betweenGermandu andSie - Japanese
does supply a great deal of informationconcerningpoliteness, formality,relative
status,etc., but none of these map cleanly into the Germandistinction);choice of
possessive determiners(Japaneseoften uses only namae, 'name', where English
would haveyour name);and severalothersortsof mismatch.
(25)a. He has been studying,
b. He is studying.
11. Conclusions
The firstsection of the paperdescribeda "lowroad"or "quickanddirty"approach
to SLT, in which interactivedisambiguationof SR and translationis temporar-
ily substitutedfor system integration.This approach,I believe, is likely to yield
broad-coveragesystems with usablequalitysoonerthanapproacheswhich aim for
maximallyautomaticoperationbased upontight integrationof knowledgesources
and components.
Two demonstrationsof "quickand dirty"SLT over the Internetwere reported.
For the demos, an experimentalchat translationsystem createdby CompuServe,
Inc. was providedwith front and back ends, using commercialdictationproducts
for speech inputand commercialspeech-synthesisengines for speech output.The
dictationproducts'standardinterfaceswere used to debug dictationresults inter-
actively.While evaluationof these experimentsremainedinformal,coveragewas
much broaderthan in most SLT experimentsto date - in the tens of thousands
of words. While interactivecontrolof translationwas lacking, outputqualitywas
probablysufficientfor many social exchanges.
But while the "low road"may offer the fastest route to usable broad-coverage
SLT systems, automaticoperationbased upon knowledge-source integrationis
certainto remaindesirablein the longer run. Hence the balance of the paperhas
concentratedon aspectsof integratedsystems.
Takentogether,the nine areasof researchexaminedin the papersuggest a nine-
item wish list for an experimentalSLTsystem.
1. The system would include facilities for interactivedisambiguationof both
speech and translationcandidates.
2. Its architecturewould allow modularreconfigurationand global coordination
of components.
3. It would employ a perspicuousset of data structuresfor trackinginformation

from multiple processes: stages of translation,multiple tracks, and height,
span,or dominanceof nodes would be clearlydistinguished.
4. The systemwould employ a grammarwhose terminalswere phones,recogniz-
ing both words and syntactic structuresin a uniformand integratedmanner,
e.g., via island-drivenchartparsing.
5. Natural pauses and other aspects of prosody would be used to segment
utterancesand otherwiseaid analysis.
6. Similarity-basedtechniquesfor resolving ambiguities,comparableto those of
EBMT,would be effectively used. Stages of translationyielding potentialam-
biguities would be kept distinct;similaritywould be measuredalong several
dimensions (e.g., syntactic and phonological in addition to semantic);top-
downas well as bottom-upconstraintswould be exercised;anddisambiguation
using bothprobability-basedand similarity-basedtechniqueswould be used in
complementaryfashion.
7. Speech or dialogue acts would be defined in terms of their cue patterns,and
analysesbased upon them would be exploitedfor SR and analysis.
8. Semanticallysmoothedtrackingof lexical co-occurrenceswouldprovidea net-
workof associationsusefulfor SR, lexical disambiguation,andtopic-boundary
recognition.
9. A suite of specializedprogramswould help to resolve translationmismatches,
for instanceto supplyreferentsfor zero pronouns.
Acknowledgements
Warmestappreciationto CompuServe,Inc. for makingthe chat-basedSLTdemon-
strationspossible. In particular,thanksare due to Mary Flanagan,then Manager,
Advanced Technologies, and to Sophie Toole, then Supervisor,Language Sup-
port. Ms. Flanagan authorizedand oversaw both demos. Ms. Toole organized
and conductedthe Grenobledemo and played an active role in making the SR
and speech-synthesis softwareoperational.Thanksalso to Phil Jensen and Doug
Chinnock, translationsystem engineers. The demos made use of pre-existing
proprietarysoftware.
Workon all nine of the issues discussed here began at ATR InterpretingTele-
communicationsLaboratoriesin Kyoto, Japan.I am very gratefulfor the support
and stimulationI receivedthere.
Thanksalso to numerouscolleagues at GETA (Grouped'Etudespour la Tra-
ductionAutomatique)at the UniversiteJosephFourierin Grenoble,France;and at
DFKI (DeutschesForschungszentrum fur KiinstlicheIntelligenz)in Saarbriicken,
Germany.
The opinionsexpressedthroughoutare mine alone.
180 MARKSELIGMAN
Notes
CompuServe'schat translationprojectwas discontinuedin early 1998. All trademarksare hereby
acknowledged.
A later commercialchat translationservice, that of Uni-verse, Inc. (now discontinued),gave a
comparablethroughputin 2-3 seconds.
3 ContinuousFrenchwas released
just before the second demo, but because little testing time was
available,a decision was made to forego its use.
SpeechLinkssoftwarefrom SpeechOne,Inc.
5
By March1998, upgradesof the continuoussoftwarehad alreadymade this macroless necessary.
Directdictationto the chat window would then have been possible withoutit, with some sacrificeof
advancedfeaturesfor voice-driveninteractivecorrectionof errors.
6 Kowalskiet al.
(1995) arrangedthe only previousdemonstrationknownto the authorof SLTusing
commercialdictationsoftwarefor input(thoughat least one group(Miikeet al., 1988) hadpreviously
simulatedSLT after a fashion by automaticallytranslatingtyped conversations).Since Kowalski's
users (spectatorsat twin exposition displays in Boston, Massachusettsand Lyons, France) were
untrained,little interactivecorrectionof dictationwas possible. Forthis andotherreasons,translation
qualitywas generallylow (BurtonRosenberg,personalcommunication);but as the main purposeof
the demo was to make an artistic and social statementconcerningfuture hi-tech possibilities for
cross-culturalcommunication,this was no great cause for concern. Text was transmittedvia FTP,
ratherthanvia chat as in the experimentsreportedhere. See Seligman(1997) for a fuller account.
www.itl.atr.co. jp/matrix/c-s tar /matrix. en. html
o
www.c-star.org
9 Mailbox files were
extensively and successfully used in the Frenchentry in the C-Star II SLT
demo of July 22, 1999 (www.c-star.org).
0 Inclusion of other levels is also
possible. At the lower limit, assuming the grammarwere
stochastic,one could even use sub-phonespeech segments as grammarterminals,thus subsuming
even HMM-basedphone recognitionin the parsingregime.At an intermediatelevel betweenphones
andwords, syllables could be used.
11 The
parse tree was not used for analysis, however.Instead,it was discarded,and a unification-
based parserbegan a new parsefor MT purposeson a text stringpassed from speech recognition.
For one exampleof extensiverelatedwork in the frameworkof the Verbmobilsystem, see Kompe
etal. (1997).
13A relatedbut distinct
proposalappearsin Hosakaet al. (1994).
14 Research
Entropic Laboratory,Washington,DC, 1993.
15PanEBMT
operatessolo only when the entire source expressioncan be renderedwith a single
memorizedtargetexpression.
16The sort of
generalizationsuggested here - from graded semantic similaritymeasurementsto
gradedmeasurementsof similarityalong multiple dimensions - should not be confused with that
of GeneralizedEBMT,the example-basedtechniqueproposedfor CMU's PanEBMTengine. That
engine utilizes no gradedsimilaritymeasurementsalong any scale. Its generalizationinsteadinvolves
substitutionof semantic tags for lexical items in examples and in input, so that for example John
Hancockwas in Washingtonbecomes (PERSON) was in (CITY).
In this calculation,fixed elements are treateddifferentlyfrom variableelements,and variableele-
ments can be weighted to varyingdegrees:the heads of complex structuresare differentlyweighted
thannon-heads.
See for examplethe website of the Discourse ResourceInitiative,
www.georgetown.edu/luper-foy/Discourse-Treebank/dri-home.html, with links to recent
workshops,or browse Walker(1999), especially regardingattemptedstandardizationof Japanese
discourselabeling(Ichikawaet al., 1999).
NINEISSUESIN SPEECH
TRANSLATION 181
Called "CommunicativeActs" in Seligmanet al. (1995) and "SituationalFormulas"in Seligman

(1991).
20 While
speakersoften repeatan interlocutor'sutteranceto confirmit, we do not use a REPEAT-TO-
CONFIRM CBSA, since it is apparentlysignaledby no cue patterns,andthuscould only be recognized
by noting inter-utterancerepetition.
21 For
example,if the conceptmoney has been observed,then the lexical item bankhas the meaning
closest to MONEYin the network:'savings institution'ratherthan 'edge of river',etc.
22 See Schiitze
(1998) or Veling and van der Werd (1999) concerning the use of co-occurrence
networksfor disambiguation,thoughwithoutcomparablesegmentationor semanticsmoothing.
23
Compare,for example, Morris and Hirst (1991), Hearst (1994), Nomoto and Nitta (1994), or
KozimaandFurugori(1994).
24 One
possible formulation:"Canthe audience easily identify which one is meant?"See Boitet
(1996a) for discussion.
References
Aberdeen,John, Sam Bayer, Sasha Caskey,LaurieDamianos,Alan Goldschen,LynetteHirschman,
Dan LoehrandHugo Trappe:1999, 'ImplementingPracticalDialogue Systems with the DARPA
CommunicatorArchitecture',IJCAI-99 Workshopon Knowledge and Reasoning in Practical
Dialogue Systems,Stockholm,Sweden, pp. 81-86.
Alexandersson,Jan, Norbert Reithinger and Elisabeth Maier: 1997, 'Insights into the Dialogue
Processing of Verbmobil', Fifth Conference on Applied Natural Language Processing,
Washington,DC, pp. 33-40.
Barnett, Jim, Kevin Knight, Inderjeet Mani and Elaine Rich: 1990, 'Knowledge and Natural
LanguageProcessing', Communicationsof the ACM33(8), 50-71.
Black, Alan: 1997, 'Predictingthe Intonationof Discourse Segments from Examples in Dialogue
Speech', in Y. Sagisaka,N. CampbellandN. Higuchi(eds), ComputingProsody,SpringerVerlag,
Berlin,pp. 117-128.
Black, Ezra,Roger Garsideand GeoffreyLeech: 1993, Statistically-drivenComputerGrammarsof
English: TheIBM/LancasterApproach,Rodopi, Amsterdam.
Blanchon, Herve: 1996, 'A Customizable Interactive Disambiguation Methodology and Two
Implementationsto DisambiguateFrenchand EnglishInput',in C. Boitet (1996a), pp. 190-200.
Boitet, Christian(ed.): 1996a, Proceedings of MIDDIM-96Post-COLINGSeminar on Interactive
Disambiguation,Le Col de Porte,France.
Boitet, Christian:1996b, 'Dialogue-basedMachine Translationfor Monolingualsand FutureSelf-
explainingDocuments',in C. Boitet (1996a), pp. 75-85.
Boitet, Christianand Mark Seligman: 1994, 'The "Whiteboard"Architecture:A Way to Integrate
HeterogeneousComponentsof NLP Systems', COLING94, The 15th InternationalConference
on ComputationalLinguistics,Kyoto, Japan,pp. 426^-30
Brown, Ralph D.: 1996, 'Example-basedMachine Translationin the Pangloss System', COLING-
96, The 16th InternationalConferenceon ComputationalLinguistics,Copenhagen,Denmark,
pp. 169-174.
Dohsaka, K.: 1990, 'Identifying the Referents of Zero-pronounsin Japanese Based on Prag-
matic ConstraintInterpretation',9th EuropeanConferenceon ArtificialIntelligence,ECAI y90,
Stockholm,Sweden, pp. 240-245.
Erman, Lee D. and Victor R. Lesser: 1990, 'The Hearsay-IISpeech UnderstandingSystem: A
Tutorial',in A. Waibeland K.-F.Lee (eds), Readingsin SpeechRecognition,MorganKaufmann,
San Mateo,CA, pp. 235-245.
Fano,RobertM.: 1961, Transmissionof Information:A StatisticalTheoryof Communications,MIT
Press, Cambridge,MA.
182 MARKSELIGMAN
Fillmore, Charles J., Paul Kay and CatherineO'Connor: 1988, 'Regularityand Idiomaticity in
GrammaticalConstructions:The Case of Let Alone', Language64, 501-538.
Flanagan,Mary: 1997, 'Machine Translationof InteractiveTexts', In MT Summit VI, Machine
Translation:Past PresentFuture,San Diego, CA, p. 50.
Frederking,Robert,AlexanderRudnicky,and ChristopherHogan: 1997, 'InteractiveSpeech Trans-
lation in the DIPLOMATProject', SpokenLanguage Translation:Proceedings of a Workshop
Sponsoredby the Associationfor ComputationalLinguisticsand by the EuropeanNetworkin
Languageand Speech (ELSNET),Madrid,Spain,pp. 61-66.
FurukawaRyo, Yato Fumihiro and Loken-Kim Kyung-ho: 1993, Denwakaiwa o Maruchimedia
Kaiwano Tokuchobunseki [MultimediaDialogue FeatureAnalysis of TelephoneConversations].
Technical Report TR-IT-0020, ATR InterpretingTelecommunicationsLaboratories,Kyoto,
Japan.
Furuse, Osamu and Hitoshi Iida: 1996, 'IncrementalTranslationUsing Constituent Boundary
Patterns', COLING-96, The 16th International Conference on ComputationalLinguistics,
Copenhagen,Denmark,pp. 412-417.
Gorz, Giinther,MarcusKesseler, Jorg Spilker and Hans Weber: 1996, 'Researchon Architectures
for IntegratedSpeech/LanguageSystems in Verbmobil',COLING-96,The 16th International
Conferenceon ComputationalLinguistics,Copenhagen,Denmark,pp. 484-489.
Grosz, BarbaraJ., Aravind K. Joshi and Scott Weinstein: 1983, 'Providinga Unified Account of
DefiniteNoun Phrasesin Discourse', 21st AnnualMeetingof theAssociationfor Computational
Linguistics,Cambridge,MA, pp. 44-50.
Hearst,MartiA.: 1994, 'Multi-paragraph Segmentationof ExpositoryText', 32nd AnnualMeeting
of the Associationfor ComputationalLinguistics,Las Cruces,NM, pp. 9-16.
Hosaka,Junko,MarkSeligmanand HaraldSinger: 1994, 'Pauseas a PhraseDemarcatorfor Speech
and LanguageProcessing', COLING94, The 15th InternationalConferenceon Computational
Linguistics,Kyoto, Japan,pp. 987-991.
Ichikawa,A., M. Araki,Y. Horiuchi,M. Ishizaki,S. Itabashi,T. Itoh,H. Kashioka,K. Kato,H. Kiku-
chi, H. Koiso, T. Kumagai,A. Kurematsu,K. Maekawa,S. Nakazato,M. Tamoto,S. Tutiya,Y.
YamashitaandT. Yoshimura:1999, 'Evaluationof AnnotationSchemes for JapaneseDiscourse',
in M. Walker(1999), pp. 26-34.
Iida, Hiroshi, Eichiro Sumita and Osamu Furuse: 1996, 'Spoken Language TranslationMethod
Using Examples', COLING-96,The 16th InternationalConferenceon ComputationalLinguis-
tics, Copenhagen,Denmark,pp. 1074-1077.
Iwadera,T, M. Ishizaki and T. Morimoto:1995, 'Recognizing an InteractionalStructureand Top-
ics of Task-orientedDialogues', Proceedings of the European Workshopon SpokenDialogue
Systems,Vigs0, Denmark,pp. 41-44.
Jokinen, Kristiina,Hideki Tanakaand Akio Yokoo: 1998, 'Context Managementwith Topics for
Spoken Dialogue Systems', COLING- ACL '98: 36th Annual Meeting of the Associationfor
ComputationalLinguistics and 17th InternationalConferenceon ComputationalLinguistics,
Montreal,Canada,pp. 631-637.
Joshi, AravindK. and Scott Weinstein:1981, 'Controlof Inference:Role of some Aspects of Dis-
course Structure- Centering',SeventhInternationalJoint Conferenceon ArtificialIntelligence
(IJCAI-81),Vancouver,BC, pp. 385-387.
Julia, L., L. Neumeyer, M. Charafeddine, A. Cheyer, and J. Dowding: 1997,
'HTTPV/WWW.SPEECH.SRI.COM/DEMOS/ATIS.HTML', Workingnotes of the AAAF97
Spring SymposiumWorkshopon Natural LanguageProcessingfor the Web,Stanford,CA, pp.
72-76.
Jurafsky,Daniel: 1993, A Cognitive Model of Sentence Interpretation:The ConstructionGram-
mar Approach. Technical Report TR-93-077. InternationalComputer Science Institute and
Departmentof Linguistics,Universityof California,Berkeley.
Kay,Paul: 1990, 'Even', Linguisticsand Philosophy13, 59-216.
Knott,Alistair:1996,A Data-drivenMethodologyfor Motivatinga Set of CoherenceRelations,Ph.D.

thesis, Departmentof ArtificialIntelligence,Universityof Edinburgh.
Knott,Alistairand RobertDale: 1995, 'Using LinguisticPhenomenato Motivatea Set of Rhetorical
Relations',Discourse Processes 18, 35-62.
Kompe, R., A. Kiessling, H. Niemann, E. Noeth, A. Batliner, S. Schachtl, R. Ruland and H. U.
Block: 1997, 'ImprovingParsingof SpontaneousSpeech with the Help of ProsodicBoundaries',
1997 IEEEInternationalConferenceon Acoustics,Speech,and Signal Processing(ICASSP'97),
Munich,Germany,pp. 811-814.
Kowalski,Piotr,BurtonRosenbergand JeffreyKrause:1995, 'InformationTranscript',Biennale de
Lyond'Art Contemporain,Lyon, France.
Kozima, Hideki and Teiji Furugori: 1994, 'Segmenting NarrativeText into Coherent Scenes',
Literaryand LinguisticComputing9, 13-19.
Lenat, Douglas B. and R. V. Guha: 1990, Building Large Knowledge-basedSystems, Addison-
Wesley,Reading,MA.
Loken-Kim,Kyung-ho,FumihiroYato, KazuhikoKurihara,LaurelFais and Ryo Furukawa:1993,
AMUSE - ATR Multi-media Simulation Environment.Technical Report TR-IT-0018, ATR
InterpretingTelecommunicationsLaboratories,Kyoto, Japan.
Mahesh,Kavi (ed.): 1997, NaturalLanguageProcessingfor the WorldWideWeb:Papersfrom the
1997 AAAISpringSymposium,The AAAI Press, Cambrdige,MA.
Miike, Seiji, KoichiHasebe,HaroldSomersandShin-yaAmano: 1988, 'Experienceswith an On-line
TranslatingDialogue System', The 26th AnnualMeeting of the Associationfor Computational
Linguistics,Buffalo, NY, pp. 155-162.
Morimoto,T., T. Takezawa,F. Yato, S. Sagayama,T. Tashiro,M. Nagata and A. Kurematsu:1993,
'ATR'sSpeech TranslationSystem:ASURA', EuropeanConferenceon Speech Communication
and Technology,Berlin,Germany,pp. 1295-1298.
Morris,Jane and GraemeHirst: 1991, 'Lexical Cohesion Computedby ThesauralRelations as an
Indicatorof the Structureof Text', ComputationalLinguistics17, 21^-8.
Murata,Masaki and Makoto Nagao: 1993, 'Determinationof ReferentialPropertyand Numberof
Nouns in JapaneseSentences for Machine Translationinto English', Proceedings of the Fifth
InternationalConferenceon Theoreticaland MethodologicalIssues in MachineTranslationTMI
'93 - MT in the Next Generation,Kyoto, Japan,pp. 218-225.
Nadas,Arthur:1985, 'OnTuring'sFormulafor WordProbabilities',IEEETransactionson Acoustics,
Speechand Signal Processing33, 1414-1416.
Nagao, Makoto: 1984, 'A Frameworkof a MechanicalTranslationbetween Japaneseand English
by Analogy Principle', in A. Elithornand R. Banerji(eds), Artificialand HumanIntelligence,
North-Holland,Amsterdam,pp. 173-180.
Nagata, Masaaki and Tsuyoshi Morimoto: 1993, An ExperimentalStatisticalDialogue Model to
Predict the Speech Act Type of the Next Utterance', Proceedings of ISSD-93, International
Symposiumon SpokenDialogue - New Directionsin Humanand Man-machineCommunication,
Tokyo,Japan,pp. 83-86.
Nomoto, TadashiandYoshihikoNitta: 1994, 'A Grammatico-statistical Approachto DiscourseParti-
tioning', COLING94, The 15th InternationalConferenceon ComputationalLinguistics,Kyoto,
Japan,pp. 1145-1150.
Ohno Susumuand HamanishiMasando:1981, KadokawaRuigo Shin-jiten[KadokawaNew Word
CategoryDictionary],KadokawaShoten,Tokyo.
Pyra,Marianne:1995, Using InternetRelay Chat,Que Corporation,Indianapolis,IN.
Reithinger,Norbert:1995, 'Some Experimentsin Speech Act Prediction',in JohannaMoore and
MarilynWalker(eds), EmpiricalMethods in Discourse: Interpretation& Generation:Papers
from the 1995 AAAISymposium,AAAI Press, Cambridge,MA, pp. 126-131.
184 MARKSELIGMAN
Reithinger, Norbert and Martin Klesen: 1997, 'Dialogue Act Classification Using Language
Models', Proceedings of the 5th European Conferenceon Speech Communicationand Tech-
nology (Eurospeech),Rhodes, Greece,pp. 2235-2238.
Sato, Satoshi: 1991, Example-basedMachine Translation,Doctoral thesis (in Japanese), Kyoto
University,Japan.
Schutze,Hinrich:1998, 'AutomaticWordSense Discrimination',ComputationalLinguistics24, 97-
124.
Searle,J.: 1969, SpeechActs, CambridgeUniversityPress, Cambridge,England.
Seligman, Mark: 1991, GeneratingDiscoursesfrom Networks Using an Inheritance-basedGram-
mar, Dissertation,Departmentof Linguistics,Universityof California,Berkeley.
Seligman,Mark:1994a, CO-OC:Semi-automaticProductionof Resourcesfor TrackingMorpholo-
gical and SemanticCo-occurrencesin SpontaneousDialogues. TechnicalReportTR-IT-0084,
ATRInterpretingTelecommunicationsLaboratories,Kyoto, Japan.
Seligman,Mark:1994b, CNTR:Basic Functionsfor CenteringExperimentswithASURA.Technical
ReportTR-IT-0085,ATRInterpretingTelecommunicationsLaboratories,Kyoto, Japan.
Seligman,Mark:1997, 'InteractiveReal-timeTranslationvia the Internet',in K. Mahesh(1997), pp.
142-148.
Seligman, Mark, Jan Alexanderssonand KristiinaJokinen: 1999, 'TrackingMorphologicaland
SemanticCo-occurrencesin SpontaneousDialogues', IJCAI-99 Workshopon Knowledgeand
Reasoningin Practical Dialogue Systems,Stockholm,Sweden, pp. 105-1 11.
Seligman, Mark and ChristianBoitet: 1993, 'A "Whiteboard"Architecturefor AutomaticSpeech
Translation',Proceedings of ISSD-93, InternationalSymposiumon Spoken Dialogue - New
Directionsin Humanand Man-machineCommunication,Tokyo,Japan,pp. 243-246.
Seligman,Mark,ChristianBoitet and BoubakerMeddeb-Hamrouni:1998a, 'TransformingLattices
into Non-deterministicAutomatawith Optional Null Arcs', COLING-ACL'98: 36th Annual
Meeting of the Associationfor ComputationalLinguistics and 17th InternationalConference
on ComputationalLinguistics,Montreal,Canada,pp. 1205-1211.
Seligman, Mark, Laurel Fais and Mutsuko Tomokiyo: 1995, A Bilingual Set of Communicat-
ive Act Labels for SpontaneousDialogues. Technical Report TR-IT-0081, ATR Interpreting
TelecommunicationsLaboratories,Kyoto, Japan.
Seligman, Mark, Mary Flanagan and Sophie Toole: 1998b, 'Dictated Input for Broad-coverage
Speech Translation',Associationfor Machine Translationin the Americas (AMTA-98),Work-
shop on EmbeddedMT Systems:Design, Construction,and Evaluationof Systemswith an MT
Component,Langhorne,PA.
Seligman, Mark,Junko Hosaka and HaraldSinger: 1997, '"Pause Units" and Analysis of Spon-
taneousJapaneseDialogues: PreliminaryStudies', in E. Meier, M. Mast and S. Luperfoy(eds),
Dialogue Processing in SpokenLanguageSystems,Springer,Berlin,pp. 110-112.
Seligman, Mark, Masami Suzuki and Tsuyoshi Morimoto: 1993. Semantic-level Transfer in
Japanese-GermanSpeechTranslation:SomeExperiences.TechnicalReportNLC93-13,Institute
of Electronics,Information,and CommunicationEngineers(IEICE),Japan.
Sidner,Candace: 1979, Towarda ComputationalTheoryof Definite Anaphora Comprehensionin
English. TechnicalReportAI-TR-537,MIT,Cambridge,MA.
Sobashima, Yauhiro and Hitoshi Iida: 1995, 'A Multi-dimensional Analogy-based, Context-
dependent,Bottom-up ParsingMethod for Spoken Dialogues', ThirdNatural Language Pro-
cessing PacificRimSymposiumNLPRS'95,Seoul, Korea,pp. 586-591.
SobashimaYasuhiroand MarkSeligman: 1994, 'Yorei to no tagentekiruijidokeisan ni motodzuku
bunmyakuizon no kobunkaiseki ho', [ParsingMethodfor Example-basedAnalysis Integrating
Multiple Knowledge Sources], Shadan hojinjoho shod gakkai dai49 kai zenkokutaikai koen
ronbunshu, Vol. 3, Sapporo,Japan,pp. 103-104.
NINEISSUESIN SPEECH
TRANSLATION 185
Stock, Oliviero, Rino Falcone and PatriziaInsinnamo: 1989, 'Bi-directionalCharts:A Potential

Techniquefor ParsingSpoken NaturalLanguageSentences', ComputerScience and Language
3, 219-237.
Sumita, Eichiro and Hitoshi Iida: 1992, 'Example-basedTransferof Adnominal Particles into
English', IEICETransactionson InformationSystemsE75-D(4), 585-594.
Takeda,Shingo and Norihisa Doi: 1994, 'Centeringin Japanese:A Step TowardsBetter Interpret-
ation of Pronouns and Zero-pronouns',COLING94, The 15th International Conferenceon
ComputationalLinguistics,Kyoto, Japan,pp. 1151-1 156.
Takezawa,Toshiyuki, Fumiaki Sugaya and Akio Yokoo: 1999, 'ATR-MATRIX:A Spontaneous
Speech TranslationSystem betweenEnglish and Japanese',ATRJournal2, 29-33.
Veling, Anne and Peter van der Weerd: 1999, 'ConceptualGroupingin WordCo-occurrenceNet-
works', IJCAI-99 Proceedings of the Sixteenth InternationalJoint Conference on Artificial
Intelligence,Stockholm,Sweden, pp. 694-699.
Wahlster,W.: 1993, Verbmobil:Translationof Face-to-Face Dialogs. ResearchReportRR-93-34,
GermanResearchCenterfor ArtificialIntelligence(DFKI GmbH),Saarbriicken,Germany.
Walker,Marilyn(ed.): 1999, Proceedings of the ACL '99 Workshop:TowardsStandardsand Tools
for Discourse Tagging,College Park,MD.
Zajac,Remi and MarkCasper:1997, 'The TempleWeb Translator',in K. Mahesh(1997), pp. 149-
154.

Nine Issues in Translation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nine Issues in Translation

Uploaded by

Copyright:

Available Formats

Nine Issues in Speech Translation

Author(s): Mark Seligman

Nine Issues in Speech Translation

Abstract. This papersketchesresearchin nine areasrelatedto spokenlanguagetranslation:interac-

Key words: dialogue acts, example-basedmachinetranslation,interactivedisambiguation,pauses,

However,the first topic to be discussed is a renegade,headed in exactly the

2.1. TWO INTERACTIVE DEMOS

At the time of the demos, the discreteproductsallowed dictationdirectlyinto

(1) French: Qu'est-ceque vous etudiez?(Whatdo you study?)

2.2. DISCUSSION OF DEMOS

comprehensiblyand within tolerabletime limits. It is truethatthe language,while

suppliedby the dictationproduct,including"scratchthat","correct(word)",etc. -

2.2. 1. Correctionof Dictation

2.2.2. Correctionof Translation

The currentdemos were not intendedto demonstratethe full range of interactive

lation. Thus, when ambiguitiesoccurred,the speakerhad no way to control or

(3)a. It was a pleasureworkingwith you.

b. C'etait unplaisirfonctionneravec vous.

c. It was a pleasurefunctioningwith you.

Having discussed the role of interactivedisambiguationin SLT, and having

architecturesfor client-serveroperations:the Coordinatorbecomes the mainclient

Figure 1. Multi-dimensionaldata structuresfor speech translation.

mightincludesemanticstructuresderivedfromthese trees;and so on, throughMT

and sometimes a gradationof dominance or span. The hope is that, by clearly

5. Interface between SR and MT Analysis

satisfactory,because of the gaps, repairs,and othernoise common in spontaneous

6. Use of Pauses for Segmentation

Figure 2. Interface used by the pause tagger.

features.We then added to the transcriptsinformationconcerningthe placement

within the multi-engineMT architecturein the DiplomatSLT system (Frederking

(5)a. kyotono kaigi

(6)a. tokyono kaigi

(7)a. nyuyokuno kaigi

c. conferencein New York

(8). tokyono miitingu

At any numberof such comparisonpoints, semantic similarity of the relevant

(9)a. [?X no ?Y]

b. [[?X no ?Y] no ?Z]

c. [?X no [?Y no ?Z]]

If we then receive an input like (10) we can determinewhich bracketingis most

(10) kyoto no kaigi no ronbun

One possibilityis to designatea "head"for each pattern,andto posit thata pattern's

can calculatesimilarityscores for severalpossible bracketings(trees),and choose

7.1. SEPARATION OF EXAMPLE-BASED ANALYSIS, TRANSFER, AND

Recall that semanticsimilaritycalculationcan be used to select an embeddedset

(11) Zo wa, hana ga nagai.

(12) Taeko wa, kaminokewa nagai

(13) Watashiwa, taeko ga sukidesu

7.2. MULTIPLE DIMENSIONS OF SIMILARITY

So far we have discussed the measurementof similarityalong the semanticscale

[/] and [k] are more different.Below, we brieflyindicatehow multiplesimilarity

7.3. BOTH TOP-DOWN AND BOTTOM-UP

In most currentexample-basedsystems, the applicabilityof a patternis judged by

(14) C-Sim(A, B) = E-Sim(LA, LB) + E-Sim(tfA, RB)

The final, or integrated,similarityscore Sim for expressions Si and S2, then,

(15) Sim(S!, S2) = S-Sim(Si, S2) * C-Sim(Si, S2)

We have seen thatSim incorporatesmulti-dimensionalsimilaritymeasurements

7.4. ANALYSIS WITH MULTI-DIMENSIONAL, TOP-DOWN AND BOTTOM-UP

We can considerthe trainingstage first.In this stage, an examplebase is prepared

(16) However,we do have single rooms with a showerfor eighty dollarsand

Here,manyspuriouscombinations,e.g., showerfor eightydollars a nightand twin

7.5. SIMILARITY VS. FREQUENCY

We have been discussing the uses of similaritycalculationsfor the resolutionof