You are on page 1of 8

PREDICTING INTONATIONAL PHRASING FROM TEXT

Michelle Q. Wang Julia Hirschberg


Churchill College AT&T Bell Laboratories
Cambridge University 600 Mountain Avenue
Cambridge UK Murray Hill, NJ 07974

Abstract signment, most authors assume that syntactic con-


figuration provides the basis for prosodic 'defaults'
Determining the relationship between the intona- that may be overridden by semantic or discourse
tional characteristics of an utterance and other considerations. While most interest in boundary
features inferable from its text is important both prediction has been focussed on synthesis (Gee
for speech recognition and for speech synthesis. and Grosjean, 1983; Bachenko and Fitzpatrick,
This work investigates the use of text analysis 1990), currently there is considerable interest in
in predicting the location of intonational phrase predicting boundaries to aid recognition (Osten-
boundaries in natural speech, through analyzing doff et al., 1990; Steedman, 1990). The most
298 utterances from the DARPA Air Travel In- successful empirical studies in boundary location
formation Service database. For statistical model- have investigated how phrasing can disambiguate
ing, we employ Classification and Regression Tree potentially syntactically ambiguous utterances in
(CART) techniques. We achieve success rates of read speech (Lehiste, 1973; Ostendorf et al., 1990).
just over 90%, representing a major improvement Analysis based on corpora of natural speech (Ab
over other attempts at boundary prediction from tenberg, 1987) have so far reported very limited
unrestricted text. 1 success and have assumed the availability of syn-
tactic, semantic, and discourse-level information
well beyond the capabilities of current NL systems
Introduction to provide.

The relationship between the intonational phras- To address the question of how boundaries are
ing of an utterance and other features which can assigned in natural speech - - as well as the need
be inferred from its transcription represents an for classifying boundaries from information that
important source of information for speech syn- can be extracted automatically from text - - we
thesis and speech recognition. In synthesis, more examined a multi-speaker corpus of spontaneous
natural intonational phrasing can be assigned if elicited speech. We wanted to compare perfor-
text analysis can predict human phrasing perfor- mance in the prediction of intonational bound-
mance. In recognition, better calculation of prob- aries from information available through simple
able word durations is possible if the phrase-final- techniques of text analysis, to performance us-
lengthening that precedes boundary sites can be ing information currently available only come from
predicted. Furthermore, the association of intona- hand labeling of transcriptions. To this end,
tional features with syntactic and acoustic infor- we selected potential boundary predictors based
mation can also be used to reduce the number of upon hypotheses derived from our own observa-
sentence hypotheses under consideration. tions and from previous theoretical and practi-
Previous research on the location of intonational cal studies of boundary location. Our corpus for
boundaries has largely focussed on the relation- this investigation is 298 sentences from approxi-
ship between these prosodic boundaries and syn- mately 770 sentences of the Texas Instruments-
tactic constituent boundaries. While current re- collected portion of the DARPA Air Travel In-
search acknowledges the role that semantic and formation Service (ATIS) database(DAR, 1990).
discourse-level information play in boundary as- For statistical modeling, we employ classification
and regression tree techniques (CART) (Brieman
I We t h a n k Michael Riley for helpful discussions. Code et al., 1984), which provide cross-validated de-
implementing the CART techniques employed here was
written by Michael Riley and Daryi Pregibon. Part-of-
cision trees for boundary classification. We ob-
speech tagging employed Ken Church's tagger, and syn- tain (cross-validated) success rates of 90% for both
tactic analysis used Don Hindle's parser, Fiddltch. automatically-generated information and hand-

285
labeled data on this sample, which represents longer pauses, greater tonal changes, and more fi-
a major improvement over previous attempts to nal lengthening than minor boundaries.
predict intonational boundaries for spontaneous
speech and equals or betters previous (hand-
crafted) algorithms tested for read speech. The Experiments
The Corpus and Features Used in
Intonational Phrasing Analysis

Intuitively, intonational phrasing divides an ut- The corpus used in this analysis consists of 298
terance into meaningful 'chunks' of information utterances (24 minutes of speech from 26 speak-
(Bolinger, 1989). Variation in phrasing can change ers) from the speech data collected by Texas In-
the meaning hearers assign to tokens of a given struments for the DARPA Air Travel Information
sentence. For example, interpretation of a sen- System (ATIS) spoken language system evaluation
tence like 'Bill doesn't drink because he's unhappy.' task. In a Wizard-of-Oz simulation, subjects were
will change, depending upon whether it is uttered asked to make travel plans for an assigned task,
as one phrase or two. Uttered as a single phrase, providing spoken input and receiving teletype out-
this sentence is commonly interpreted as convey- put. The quality of the ATIS corpus is extremely
ing that Bill does indeed drink - - but the cause diverse. Speaker performance ranges from close to
of his drinking is not his unhappiness. Uttered as isolated-word speech to exceptional fluency. Many
two phrases, it is more likely to convey that Bill utterances contain hesitations and other disfluen-
does sot drink - - and the reason for his abstinence cies, as well as long pauses (greater than 3 sec. in
is his unhappiness. some cases).
To characterize this phenomenon phonologi- To prepare this data for analysis, we labeled the
cally, we adopt Pierrehumbert's theory of into- speech prosodically by hand, noting location and
national description for English (Pierrehumbert, type of intonational boundaries and presence or
1980). In this view, two levels of phrasing are sig- absence of pitch accents. Labeling was done from
nificant in English intonational structure. Both both the waveform and pitchtracks of each utter-
types are composed of sequences of high and low ance. Each label file was checked by several la-
tones in the FUNDAMENTAL FREQUENCY (f0) con- belers. Two levels of boundary were labeled; in
tour. An INTERMEDIATE (or minor) PHRASE con- the analysis presented below, however, these are
slats of one or more PITCH ACCENTS (local f0 min- collapsed to a single category.
ima or maxima) plus a PHRASE ACCENT (a simple We define our data points to consist of all po-
high or low tone which controls the pitch from tential boundary locations in an utterance, de-
the last pitch accent of one intermediate phrase fined as each pair of adjacent words in the ut-
to the beginning of the next intermediate phrase terance < wi, wj >, where wi represents the
or the end of the utterance). INTONATIONAL (or word to the left of the potential boundary site
major) PHRASES consist of one or more intermedi- and wj represents the word to the right. 2 Given
ate phrases plus a final BOUNDARY TONE, which the variability in performance we observed among
may also be high or low, and which occurs at the speakers, an obvious variable to include in our
end of the phrase. Thus, an intonational phrase analysis is speaker identity. While for applica-
boundary necessarily coincides with an intermedi- tions to speaker-independent recognition this vari-
ate phrase boundary, but not vice versa. able would be uninstantiable, we nonetheless need
While phrase boundaries are perceptual cate- to determine how important speaker idiosyncracy
gories, they are generally associated with certain may be in boundary location. We found no signif-
physical characteristics of the speech signal. In icant increase in predictive power when this vari-
addition to the tonal features described above, able is used. Thus, results presented below are
phrases may be identified by one of more of the speaker-independent.
following features: pauses (which may be filled One easily obtainable class of variable involves
or not), changes in amplitude, and lengthening temporal information. Temporal variables include
of the final syllable in the phrase (sometimes ac- utterance and phrase duration, and distance of the
companied by glottalization of that syllable and 2See the appendix for a partial list of variables em-
perhaps preceding syllables). In general, ma- ployed, which provides a key to the node labels for the
jor phrase boundaries tend to be associated with prediction trees presented in Figures 1 and 2.

286
potential boundary from various strategic points basis for most algorithms used to assign intona-
in the utterance. Although it is tempting to as- tional phrasing for text-to-speech. Furthermore,
sume that phrase boundaries represent a purely we might expect that some words, such as preposi-
intonational phenomenon, it is possible that pro- tions and determiners, for example, do not consti-
cessing constraints help govern their occurrence. tute the typical end to an intonational phrase. We
That is, longer utterances may tend to include test these possibilities by examining part-of-speech
more boundaries. Accordingly, we measure the in a window of four words surrounding each poten-
length of each utterance both in seconds and in tial phrase break, using Church's part-of-speech
words. The distance of the boundary site from tagger (1988).
the beginning and end of the utterance is another Recall that each intermediate phrase is com-
variable which appears likely to be correlated with posed of one or more pitch accents plus a phrase
boundary location. The tendency to end a phrase accent, and each intonational phrase is composed
may also be affected by the position of the poten- of one or more intermediate phrases plus a bound-
tial boundary site in the utterance. For example, ary tone. Informal observation suggests that
it seems likely that positions very close to the be- phrase boundaries are more likely to occur in some
ginning or end of an utterance might be unlikely accent contexts than in others. For example,
positions for intonational boundaries. We measure phrase boundaries between words that are deac-
this variable too, both in seconds and in words. cented seem to occur much less frequently than
The importance of phrase length has also been boundaries between two accented words. To test
proposed (Gee and Grosjean, 1983; Bachenko and this, we look at the pitch accent values of wi and
Fitzpatrick, 1990) as a determiner of boundary lo- wj for each < wi, wj >, comparing observed values
cation. Simply put, it seems may be that consecu- with predicted pitch accent information obtained
tive phrases have roughly equal length. To capture from (Hirschberg, 1990).
this, we calculate the elapsed distance from the In the analyses described below, we employ
last boundary to the potential boundary site, di- varying combinations of these variables to pre-
vided by the length of the last phrase encountered, dict intonational boundaries. We use classification
both in time and words. To obtain this informa- and regression tree techniques to generate decision
tion automatically would require us to factor prior trees automatically from variable values provided.
boundary predictions into subsequent predictions.
While this would be feasible, it is not straightfor-
ward in our current classification strategy. So, to Classification and Regression Tree
test the utility of this information, we have used Techniques
observed boundary locations in our current anal- Classification and regression tree (CART) analy-
ysis. sis (Brieman et al., 1984) generates decision trees
As noted above, syntactic constituency infor- from sets of continuous and discrete variables by
mation is generally considered a good predictor using set of splitting rules, stopping rules, and
of phrasing information (Gee and Grosjean, 1983; prediction rules. These rules affect the internal
Selkirk, 1984; Marcus and Hindle, 1985; Steed- nodes, subtree height, and terminal nodes, re-
man, 1990). Intuitively, we want to test the notion spectively. At each internal node, CART deter-
that some constituents may be more or less likely mines which factor should govern the forking of
than others to be internally separated by intona- two paths from that node. Furthermore, CART
tional boundaries, and that some syntactic con- must decide which values of the factor to associate
stituent boundaries may be more or less likely to with each path. Ideally, the splitting rules should
coincide with intonational boundaries. To test the choose the factor and value split which minimizes
former, we examine the class of the lowest node in the prediction error rate. The splitting rules in
the parse tree to dominate both wi and wj, using the implementation employed for this study (Ri-
Hindle's parser, Fidditch (1989) To test the latter ley, 1989) approximate optimality by choosing at
we determine the class of the highest node in the each node the split which minimizes the prediction
parse tree to dominate wi, but not wj, and the error rate on the training data. In this implemen-
class of the highest node in the tree to dominate tation, all these decisions are binary, based upon
wj but not w i . Word class has also been used consideration of each possible binary partition of
often to predict boundary location, particularly values of categorical variables and consideration of
in text-to-speech. The belief that phrase bound- different cut-points for values of continuous vari-
aries rarely occur after function words forms the ables.

287
Stopping rules terminate the splitting process the tree. Thus, the tree represents a clean, sim-
at each internal node. To determine the best ple model of phrase boundary prediction, assum-
tree, this implementation uses two sets of stopping ing accurate phonological information.
rules. The first set is extremely conservative, re- Turning to the tree itself, we that the ratio of
sulting in an overly large tree, which usually lacks current phrase length to prior phrase length is very
the generality necessary to account for data out- important in boundary location. This variable
side of the training set. To compensate, the second alone (assuming that the boundary site occurs be-
rule set forms a sequence of subtrees. Each tree fore the end of the utterance) permits correct clas-
is grown on a sizable fraction of the training data sification of 2403 out of 2556 potential boundary
and tested on the remaining portion. This step is sites. Occurrence of a phrase boundary thus ap-
repeated until the tree has been grown and tested pears extremely unlikely in cases where its pres-
on all of the data. The stopping rules thus have ac- ence would result in a phrase less than half the
cess to cross-validated error rates for each subtree. length of the preceding phrase. The first and last
The subtree with the lowest rates then defines the decision points in the tree are the most trivial.
stopping points for each path in the full tree. Trees The first split indicates that utterances virtually
described below all represent cross-validated data. always end with a boundary - - rather unsurpris-
The prediction rules work in a straightforward ing news. The last split shows the importance of
manner to add the necessary labels to the termi- distance from the beginning of the utterance in
nal nodes. For continuous variables, the rules cal- boundary location; boundaries are more likely to
culate the mean of the data points classified to- occur when more than 2 ½ seconds have elapsed
gether at that node. For categorical variables, the from the start of the utterance. 3 The third node in
rules choose the class that occurs most frequently the tree indicates that noun phrases form a tightly
among the data points. The success of these rules bound intonational unit. The fourth split in 1
can be measured through estimates of deviation. shows the role of accent context in determining
In this implementation, the deviation for continu- phrase boundary location. If wi is not accented,
ous variables is the sum of the squared error for the then it is unlikely that a phrase boundary will oc-
observations. The deviation for categorical vari- cur after it.
ables is simply the number of misclassified obser- The significance of accenting in the phrase
vations. boundary classification tree leads to the question
of whether or not predicted accents will have a
similar impact on the paths of the tree. In the sec-
Results ond analysis, we substituted predicted accent val-
In analyzing boundary locations in our data, we ues for observed values. Interestingly, the success
have two goals in mind. First, we want to dis- rate of the classification remained approximately
cover the extent to which boundaries can be pre- the same, at 90%. However, the number of splits
dicted, given information which can be gener- in the resultant tree increased to nine and failed to
ated automatically from the text of an utter- include the accenting of wl as a factor in the clas-
ance. Second, we want to learn how much predic- sification. A closer look at the accent predictions
tive power can be gained by including additional themselves reveals that the majority of misclas-
sources of information which, at least currently, sifications come from function words preceding a
cannot be generated automatically from text. In boundary. Although the accent prediction algo-
discussing our results below, we compare predic- rithm predicted that these words would be deac-
tions based upon automatically inferable informa- cented, they were in fact accented. This appears
tion with those based upon hand-labeled data. to be an idiosyncracy of the corpus; such words
We employ four different sets of variables dur- generally occurred before relatively long pauses.
ing the analysis. The first set includes observed Nevertheless, classification succeeds well in the ab-
phonological information about pitch accent and sence of accent information, perhaps suggesting
prior boundary location, as well as automati- that accent values may themselves be highly cor-
cally obtainable information. The success rate of related with other variables. For example, both
boundary prediction from the variable set is ex- pitch accent and boundary location appear sen-
tremely high, with correct cross-validated classi- sitive to location of prior intonational boundaries
fication of 3330 out of 3677 potential boundary and part-of-speech.
sites - - an overall success rate of 90% (Figure 1). 3This fact m a y be idiosyncratic to o u r data, given the
Furthermore, there are only five decision points in fact t h a t we observed a t r e n d towards initial hesitations.

288
In the third analysis, we eliminate the dynamic analysis suggest -- encouragingly -- that there is
boundary percentage measure. The result remains considerable redundancy in the features predict-
nearly as good as before, with a success rate of ing boundary location: when some features are
89%. The proposed decision tree confirms the use- unavailable, others can be used with similar rates
fulness of observed accent status of wi in bound- of 8UCCe88.
ary prediction. By itself(again assuming that the
potential boundary site occurs before the end of
the utterance), this factor accounts for 1590 out of Discussion
1638 potential boundary site classifications. This
analysis also confirms the strength of the intona- The application of CART techniques to the prob-
tional ties among the components of noun phrases. lem of predicting and detecting phrasing bound-
In this tree, 536 out of 606 potential boundary aries not only provides a classification procedure
sites receive final classificationfrom this feature. for predicting intonational boundaries from text,
W e conclude our analysis by producing a clas- but it increases our understanding of the impor-
sification tree that uses automatically-inferrable tance of several among the numerous variables
information alone. For this analysis we use pre- which might plausibly be related to boundary lo-
dicted accent values instead of observed values and cation. In future, we plan to extend the set of
omit boundary distance percentage measures. Us- variables for analysis to include counts of stressed
ing binary-valued accented predictions (i.e., are syllables, automatic NP-detection (Church, 1988),
< wl, wj > accented or not), we obtain a suc- MUTUAL INFORMATION, GENERALIZED MUTUAL
cess rate for boundary prediction of 89%, and INFORMATION scores can serve as indicators of
using a four-valued distinction for predicted ac- intonational phrase boundaries (Magerman and
cented (cliticized,deaccented, accented, 'NA') we Marcus, 1990).
increased this to 90%. The tree in Figure 2) W e will also examine possible interactions
presents the latter analysis. among the statisticallyimportant variables which
Figure 2 contains more nodes than the trees have emerged from our initialstudy. C A R T tech-
discussed above; more variables are used to ob- niques have worked extremely well at classifying
tain a similar classification percentage. Note that phrase boundaries and indicating which of a set of
accent predictions are used trivially, to indicate potential variables appear most important. How-
sentence-final boundaries (ra='NA'). In figure 1, ever, CART's step-wise treatment of variables, Ol>-
this function was performed by distance of poten- timization heuristics, and dependency on binary
tial boundary site from end of utterance (at). The splits obscure the possible relationships that ex-
second split in the new tree does rely upon tem- ist among the various factors. N o w that we have
poral distance - - this time, distance of boundary discovered a set of variables which do well at pre-
site from the beginning of the utterance. Together dicting intonational boundary location, we need to
these measurements correctly predict nearly forty understand just h o w these variables interact.
percent of the data (38.2%). Th classifier next
uses a variable which has not appeared in earlier
classifications - - the part-of-speech of wj. In 2, References
in the majority of cases (88%) where wj is a func-
tion word other than 'to,' 'in,' or a conjunction Bengt Altenberg. 1987. Prosodic Patterns in Spo-
(true for about half of potential boundary sites), a ken English: Studies in the Correlation between
boundary does not occur. Part-of-speech ofwi and Prosody and Grammar for Tezt-to-Speech Con-
type of constituent dominating wi but not wj are version, volume 76 of Land Studies in English.
further used to classify these items. This portion Lund University Press, Lund.
of the classification is reminiscent of the notion of
'function word group' used commonly in assigning J. Bachenko and E. Fitzpatrick. 1990. A compu-
prosody in text-to-speech, in which phrases are de- tational grammar of discourse-neutral prosodic
fined, roughly, from one function word to the next. phrasing in English. Computational Linguistics.
Overall rate of the utterance and type of utterance To appear.
appear in the tree, in addition to part-of-speech
and constituency information, and distance of po- Dwight Bolinger. 1989. Intonation and Its Uses:
tential boundary site from beginning and end of Melody in Grammar and Discourse. Edward
utterance. In general, results of this first stage of Arnold, London.

289
Leo Brieman, Jerome H. Friedman, Richard A. Ol- Michael D. Riley. 1989. Some applications of tree-
shen, and Charles J. Stone• 1984. Classification based modelling to speech and language. In
and Regression Trees. Wadsworth & Brooks, Proceedings. DARPA Speech and Natural Lan-
Monterrey CA. guage Workshop, October.
K. W. Church. 1988. A stochastic parts pro- E. Selkirk. 1984. Phonology and Syntaz. MIT
gram and noun phrase parser for unrestricted Press, Cambridge MA.
text. In Proceedings of the Second Conference
on Applied Natural Language Processing, pages M. Steedman. 1990. Structure and intonation in
136-143, Austin. Association for Computational spoken language understanding. In Proceedings
Linguistics. of the ~Sth Annual Meeting of the Association
for Computational Linguistics.
DARPA. 1990. Proceedings of the DARPA Speech
and Natural Language Workshop, Hidden Valley
PA, June. Appendix: Key to Figures
J. P. Gee and F. Grosjean. 1983. Performance for each potential boundary, < w~, wj >
structures: A psycholinguistic and linguistic ap- type utterance type
praisal. Cognitive Psychology, 15:411-458. tt total # seconds in utterance
tw total # words in utterance
D. M. Hindle. 1989. Acquiring disambiguation st distance (sec.) from start to wj
rules from text. In Proceedings of the 27th An- et distance (sec.) from wj to end
nual Meeting, pages 118-125, Vancouver. Asso- SW distance (words) from start to wj
ciation for Computational Linguistics. ew distance (words) from wj to end
la is wi accented or n o t /
Julia Hirschberg. 1990. Assigning pitch accent or, cliticized, deaecented, accented
in synthetic speech: The given/new distinc- ra is wj accented or n o t /
tion and deaccentability. In Proceedings of the or, cliticized, deaccented, accented
Seventh National Conference, pages 952-957, per [distance (words) from last boundary]/
Boston. American Association for Artificial In- [length (words) of last phrase]
telligence. tper [distance (sec.) from last boundary]/
[length (see.) of last phrase]
I. Lehiste. 1973. Phonetic disambiguation of syn-
j{1-4} part-of-speech of wl- l,ldd + 1
tactic ambiguity. Giossa, 7:197-222.
v = verb b - be-verb
David M. Magerman and Mitchel P. Marcus. m -- modifier f = fn word
1990. Parsing a natural language using mu- n = noun p = preposition
tual information statistics. In Proceedings of w=WH
AAAI-90, pages 984-989. American Association f{slr} category of
for Artifical Intelligence. s = smallest constit dominating wl,wj
1 = largest eonstit dominating w~, not wj
Mitchell P. Marc'us and Donald Hindle. 1985. A r = largest constit dominating wj, not wi
• computational account of extra categorial ele-
ments in japanese. In Papers presented at the m = modifier d = determiner
First SDF Workshop in Japanese Syntaz. Sys- v = verb p = preposition
tem Development Foundation. w -- WH n = noun
s = sentence f = fn word
M. Ostendorf, P. Price, J. Bear, and C. W. Wight-
man. 1990. The use of relative duration in
syntactic disambiguation. In Proceedings of the
DARPA Speech and Natural Language Work-
shop. Morgan Kanfmann, June.
Janet B. Pierrehumbert. 1980. The Phonology
and Phonetics of English Intonation. Ph.D.
thesis, Massachusetts Institute of Technology,
September.

290
no

el i5

yes no

01564
564

[ no j no
2403/2556
fsn:N
IA

no no
318/367
la

'/1 no
111/137
st <~t49455St:>2.~455 ......."

Ino
61/81
I,e l
157/238

Figure 1: Predictions from Automatically-Acquired and Observed Data, 90%

291
1108/1118

1511198
ot:>O

tr:<l tr:<l
tr:>1.~11265

IndNvh

E~7-J
B682 ID,VBN,VBZ,NA

E
tr:<l
1718

R,~ .~D,IN,NA

Figure 2: Phrase Boundary Predictions from Automatically-Inferred Information, 90%

292

You might also like