You are on page 1of 11

Content-Based Video Indexing and Retrieval

Current video management tools and techniques are based on pixels rather than perceived content. Thus, state-of-theart video editing systems can easily manipulatesuch things as time codes and image frames, but they cannot know, for example, what a basketball is. Our research addresses four areas of contentbased video management.

Stephen W. Smoliar and Hongliang Zhang National University of Singapore


ideo has become an important element of multimedia computing and communication environments, with applications as varied as broadcasting, education, publishing, and military intelligence. However, video will only become an effective part of everyday computing environments when we can use it with the same facility that we currently use text. Computer literacy today entails the ability to set our ideas down spontaneously with a word processor, perhaps while examining other text documents to develop those ideas and even using editing operations to transfer some of that text into our own compositions. Similar composition using video remains far in the future, even though workstations now come equipped with built-in video cameras and microphones, not to mention ports for connecting our increasingly popular handheld video cameras. Why is this move to communication incorporating video still beyond our grasp? The problem is that video technology has developed thus far as a technology of images. Little has been done to help us use those images effectively. Thus, we can buy a camera that knows all about how to focus itself properly and even how to compensate for the fact that we can rarely hold it steady without a tripod. But no camera knows where the action is during a basketball game or a family reunion. A camera can give us a clear shot of the ball going through the basket, but only if we find the ball for it. The point is that we do not use images just because they are steady or clearly focused. We use them for their content. If we wish to compose with images in the same way that we compose

with words, we must focus our attention on content. Video composition should not entail thinking about image bits (pixels), any more than text composition requires thinking about ASCII character codes. Video content objects include basketballs, athletes, and hoops. Unfortunately, state-of-the-art software for manipulating video does not know about such objects. At best, it knows about time codes, individual frames, and clips of video and sound. To compose a video document-or even just incorporate video as part of a text document-we find ourselves thinking one way (with ideas) when we are working with text and another (with pixels) when we are working with video. The pieces do not fit together effectively, and video suffers for it. Similarly, if we wish to incorporate other text material in a document, word processing offers a powerful repertoire of techniques for finding what we want. In video, about the only technique we have is our own memory coupled with some intuition about how to use fast forward and fast reverse buttons while viewing. The moral of all this is that the effective use of video is still beyond our grasp because the effective use of its content is still beyond our grasp. How can we remedy this situation? At the Institute of Systems Science of the National University of Singapore, the Video Classification project addresses this question. We are currently tackling problems in four areas:
I Defining an architecture that characterizes the

tasks of managing video content.


I Developing software tools and techniques that

identify and represent video content.


I Applying knowledge representation techniques

to the development of index construction and retrieval tools.


I Developing an environment for interacting

with video objects. In this article, we discuss each of these problem areas in detail, then briefly review a recent case study concerned with content analysis of news videos. We conclude with a discussion of our plans to extend our work into the audio domain.
Architecture for video management Our architecture is based on the assumption that video information will be maintained in a

1070-986X/94/$4.008 1 994 IEEE

database. This assumption requires us to define tools for the construction of such databases and the insertion of new material into existing databases. We can characterize these tools in terms of a sequence of specific task requirements:
I Pming, which segments the video stream into generic clips. These clips are the elemental index units in the database. Ideally, the system decomposes individual images into semantic primitives. On the basis of these primitives, a video clip can be indexed with a semantic description using existing knowledge-representation techniques. I Indexing, which tags video clips when the system inserts them into the database. The tag includes information based on a knowledge model that guides the classification according to the semantic primitives of the images. Indexing is thus driven by the image itself and any semantic descriptors provided by the model. I RfJtriewlc 7 n d browsing, where users can access

basic units for indexing. The second set identifies different manifestations of camera technique in these clips. The third set applies content models to the identification of context-dependent semantic primitives.

Video/a udio data

Content attributes: frame based I

Knowledge

L
Raw video/audio data

reference enqine

browsing tools Appliiations

Locating camera shot boundaries

the database through queries based on text and/or visual examples or browse it through interaction with displays of meaningful icon?. Users can also browse the results of a retrieiral query. tt is important that both retrieval and browsing appeal to the user$ visual intuition. Figure 1 summarizes this task analysis as an architectural diagram. The heart of the system is a database management system containing the video and audio data from video source material that has been compressed wherever possible. The DBMS detines attributes and relations among these entities in terms of a frame-based approach to knowledge representation (described further under the subhead A frame-based knowledge base, p. 65). This representation approach, in turn, drives the indexing of entities as they are added to the database. Those entities are initially extracted by the tools that support the parsing task. In the opposite direction, the database contents are made available by tools that support the processing of both specific queries and the more general needs of casual browsing. The next three sections discuss elements of this architecture in greater detail.
Video content parsing

lhree tool sets address the parsing task. The first set segments the video source material into individual camera shots, which then serve as the

We decided that the most viable segmentation criteria for motion video are those that detect boundaries between camera shots. Thus, the imiem shot-consisting of one or more frames generated and recorded contiguously and representing a continuous action in time and space-becomes the smallest unit for indexing video. The simplest shot transition is a camera cut, where the boundary lies between two successive frames. More sophisticated transition techniques include dissolvc~s, wipes, and fade-outs-all of which take placr over a sequence of frames. In any case, camera shots can always be distinguished by significant qualitative differences. I f we can expresh those differences by a suitable quantitative measure, then we can declare a segment boundary whenever that measure exceeds a given threshold. The key issues in locating shot boundaries, therefore, are selecting suitable difference measures and thresholds, and applying them to the comparison of video frames. We now briefly review the segmentation techniques we currently employ. (For details, see Zhang et al.?) The most suitable measures rely on comparisons between the pixel-intensity histograms of two frames. The principle behind this metric is that two frames with little change in the background and object content will also differ little in their overall intensity distributions. Further strengthening this approach, it is easy to define a histogram that effectively accounts for color information. We also developed an automatic approach to detect the segmentation threshold on

Figure 1. Diagram of video management architecture.

Figure 2. A sequence of frame-to-frame histogram diferences obtained from a documentary video, where direrences corresponding both to camera breaks and to transitions implemented by special effects can be observed.

the basis of statistics of frame difference values and a multipass technique that improves processing speed.2 Figure 2 illustrates a typical sequence of difference values. The graph exhibits two high pulses corresponding to two camera breaks. It also illustrates a gradual transition occurring over a sequence of frames. In this case, the task is to identify the sequence start and end points. As the inset in Figure 2 shows, the difference values during such a transition are far less than across a camera break. Thus, a single threshold lacks the power to detect gradual transitions. A so-called twin-comparison approach solves this problem. The name refers to the use of two thresholds. First, a reduced threshold detects the potential starting frame of a transition sequence. Once that frame has been identified, it is compared against successive frames, thus measuring an accumulated difference instead of frame-toframe differences. This accumulated difference must be monotonic. When it ceases to be monotonic, it is compared against a second, higher threshold. If this threshold is exceeded, we conclude that the monotonically increasing sequence of accumulated differences corresponds to a gradual transition. Experiments have shown this approach to be very effective.2
Shot classification

camera), in which the camera position does ~ h a n g eThese . ~ operations may also occur in combinations. They are most readily detected through motion field analysis, since each operation has its own characteristic pattern of motion vectors. For example, a zoom causes most of the motion vectors to point either toward or away from a focus center, while movement of the camera itself shows up as a modal value across the entire motion field. The motion vectors can be computed by the block-matching algorithms used in motion compensation for video compression. Thus, a system can often retrieve the vectors from files of video compressed according to standards such as MPEG and H.261. The system could also compute them in real time by using chips that perform such compression in hardware.
Content models

w w

Before a system can parse content, it must first recognize and account for artifacts caused by camera movement. These movements include panning and tilting (horizontal or vertical rotation of the camera) and zooming (focal length change), in which the camera position does not change, and tracking and booming (horizontal and vertical transverse movement of the camera) and dollying (horizontal lateral movement of the

Content parsing is most effective with an a priori model of a videos structure. Such a model can represent a strong spatial order within the individual frames of shots and/or a strong temporal order across a sequence of shots. News broadcasts usually provide simple examples of such models. For example, all shots of the anchorperson conform to a common spatial layout, and the temporal structure simply alternates between the anchorperson and more detailed footage (possibly including breaks for commercials). Our approach to content parsing begins with identifying key features of the image data, which are then compared to domain models to identify objects inferred to be part of the domain. We then identify domain events as segments that include specific domain objects. Our initial experiments involve models for cut boundaries, typed shots, and episodes. The cut boundary model drives the segmentation process that locates camera shot boundaries. Once a shot has been isolated through segmentation, it can be compared against type models based both on features to be detected and on measures that determine acceptable similarity. Sequences of typed shots can then be similarly compared against episode models. We discuss this in more detail later, under Case study of video content analysis.
Index construction and retrieval tools The fundamental task of any database system is to support retrieval, so we must consider how to build indexes that facilitate such retrieval services for video. We want to base the index on semantic

properties, rather than lower level features. A knowledge model can support such semantic properties. The model for our system is a framebased knowledge base. In the following discussion, the word frame refers to such a knowledge base object rather than a video image frame.
A frame-based knowledge base An index based on semantic properties requires an organization that explicitly represents the various subject matter categories of the material being indexed. Such a representation is often realized as a semantic network, but text indexes tend to be structured as trees (as revealed by the indented representations of most book indexes). We decided that the more restricted tree form also suited our purposes. Figure 3 gives an example of such a tree. It represents a selection of topical categories taken from a documentary video about the Faculty of Engineering at the National University of Singapore. The tree structure represents relations of specialization and generalization among these categories. Note, in particular, that categories correspond both to content material about student activities (Activity)and to classifications of different approaches to producing the video (Video-Types). Users tend to classify material on the basis of the information they hope to extract. This particular set of categories reflects interest both in the faculty and in documentary production. Thus, the purpose of this topical organization is not to classify every object in the video definitively. Rather, it helps users who approach this material with only a general set of questions, orienting them in how to formulate more specific questions and what sorts of answers to expect. The frame-based knowledge base is the most appropriate technology for building such a struct ~ r eThe . ~ Fume is a data object that plays a role similar to that of a record in a traditional database. However, frames are grouped into classes, each of which represents some topical category. As Figure 3 illustrates, these classes tend to be organized in a specialization hierarchy. Such a hierarchy allows the representation of content in terms of one or more systems of categories that can then be used to focus attention for a variety of tasks. The simplest of these tasks is the casual browsing of collections of items. However, hierarchical organization also facilitates the retrieval of specific items that satisfy the sorts of constraints normally associated with a database query. Like the

Scenery
Convocation

records of a database, frames are structured as a collection of fields (usually called slots in framebased systems). These slots provide different elements of descriptive information, and the elements distinguish the topical characteristicsfor each object represented by a frame. It is important to recognize that we use frames to represent both classes (the categories) and instances (the elements categorized).As an example of a class frame, consider the Laboratory category in Figure 3. We might define the frame for it as shown in Figure 4a. Alternatively, we can define an instance of one of its subclasses in a slightly similar manner as shown in Figure 4b. Note that not all slots need to be filled in a class definition (voidindicates an unfilled slot), while

Figure 3. A tree structure of topical categories for a documentary video about engineering a t the National University of Singapore.

Name : Laboratory Superclass: Academic Suwlasses: #table[Computer-Lab Electronic-lab Mechan ical-Lab Civil-Lab Chemical-Ldbl Instances: void Description: vold Video: void Course: void Equipment : void Name: Wave-Simulator Class: Civil-Ldb Description: Monitoring plessure variation in breaklny waves. Video: WdveBreaker-CoverFldme Course : C i v L I-Enq Equipment : ittable [Corrputer ve & neL c i t o1 I

Figure 4. Examples of class frame Laboratory (top) and subclass instance


Wave-Simulator

(bottom).

they do all tend to be filled in instances. Also note that a slot can be filled by either a single value or a collection of values (indicated by the #table [...I construct). For purposes of search, it is also important to note that some slots, such as Name, Superclass, Subclasses, Instances, and Class, exist strictly for purposes of maintaining a system of frames. The remaining slots, such as Description, Video, Course, and Equipment, are responsible for the actual representation of content. These latter slots are thus the objective of all search tasks. Most frame-based knowledge bases impose no restrictions on the contents of slots: Any slot can assume any value or set of values. However, the search objective can be facilitated by strongly typing all slots. The system could enforce such a constraint through an if-addeddemon that does not allow a value to be added to a slot unless it satisfies some data typing requirement. For example, if Shot is a class whose instances represent individual camera shots from a video source, then only values that are instances of the Shot class can be added to the Video slot in frames such as those in Figure 4. Data typing can determine of how to whether or not any potential slot value is a frame, and it might even be search it. able to distinguish class frames from instance frames. However, we can make typing even more powerful if we extend it to deal with classes as if they were data types. In this case, type checking would verify not only that every potential Video slot value is an instance frame but, more specifically, that it is an instance of the Shot class. Furthermore, we could subject slot values for instances of more specific classes to even more restrictive constraints. Thus, we might constrain the Video slot of the Headings frame to check whether or not the content of a representative frame of the Shot instance being assigned consists only of characters. (We could further refine this test if we knew the fonts used to compose such headings.) What is important for retrieval purposes is that we can translate knowledge of a slots type into knowledge of how to search it. We can apply different techniques to inspecting the contents of different slots, and we can combine those techniques by means far more sophisticated than the

sorts of combinations normally associated with database query operations.


Retrieval tools

Let us now consider more specifically how we can search frames given a priori knowledge of the typing of their slots. Because a database is only as good as the retrieval facilities it supports, it must have a variety of tools, based on both text and visual interfaces. Our systems current suite of tools includes a free-text query engine and interface, the tree display of the class hierarchy, image feature-based retrieval tools, and the Clipmap. Every frame in the knowledge base includes a Description slot with a text string as its contents. Thus, the user can provide text descriptions for all video shots in the database. The free-text retrieval tool retrieves video shots on the basis of the Description slot contents. A concept-based retrieval engine analyzes the users query.6Given a free-text query specified by the user, the system first extracts the relevant terms by removing the nonfunctional words and converting those remaining into stemmed forms. The system then checks the query against a domain-specific thesaurus, after which it uses similarity measures to compare the text descriptions with the query terms. Frames whose similarity measure exceeds a given threshold are identified and retrieved, linearly ordered by the strength of the similarity. In addition to using free text, we can formulate queries directly on the basis of the category tree itself. This tree is particularly useful in identifying all shots that are instances of a common category at any level of generalization. We can then use the tree to browse instances of related categories. The class hierarchy also allows for slot-based retrieval. Free-text retrieval provides access to Description slots, but we can search on the basis of other slots as well. For example, we can retrieve slots whose contents are other frames through queries based on their situation in the class hierarchy. We can compare slots having numeric values against numeric intervals. Furthermore, if we want to restrict a search to the instances of a particular class, then the class hierarchy can tell us which slots can be searched and what types of data they contain. Retrieval based on the contents of Video slots will require computation of characteristic visual features. As an example, a user examining a video of a dance performance should be able to retrieve all shots of a particular dancer on the basis of costume color. Retrieval would then require con-

structing a model for matching regions of the color with suitable spatial properties. The primitives froin which such models are constructed then serve as the basis for the index structure. In such a database, each video clip would be represented by one or more frames, and all indexing and retrieval would be based o n the image features of those frames. Some image database systems, such as thc Query By Image Content ( Q B I C ) Project have developed techniques that support this approach. These techniques include sclectioii and computation of image features that provide useful query functionality, similarity-based retrieval methods, and interfaces that let users pose and refine queries visually and navigate their way through the database visually. We chose color, texture, and shape as basic image features and developed a prototype system with fast image-indexing abilities. This system automatically computes numerical index keys based on color distribution, prominent color region segmentation, and color histograms (as texture models) for each image. Each image is indexed by the size, color, Location, and shape of segmented regions and the color histograms of the entire image and nine subregions. 1.0 achieve fast retrieval, the system codes these image features into numerical index keys according to the significance of each feature in the query-matching proccss. This retrieval approach has proved fast and accurate. Indexing representative images essentially ignores the temporal nature of a video. Retrieval should be based on events as well as features of static images. This will require a better understanding of which temporal visual features are both important for retrieval and feasible to compute. For instance, we can retrieve zooming sequences through a relatively straightforward examination of the motion vector fielcl.However, because such vector fields are often difficult to compute (and because the motion vectors provided by compressed video are not always a reliable representation of optical flow), a more viable alternative might be to perform teature analysis on the spatio-temporal iniages. We discuss this alternative below under the subsection Micons: Icons for video content. A Clipmap is simply a window containing a collection of icons, each of which represents a camera shot. We can use Clipmaps to provide an unstructured index for a collection of shots. They can also be used to display the results of retrieval

operations. For example, rather than simpl), listing the frames retrieved by a free-text query, the system can construct a Clipmap based on the contents of the Video slot of each frame. Such ;I dispia). is especially useful when the query results in a long list of frames. For example, Figure 5 is a Clipmap constructed for a query requesting all instances of the Activity class. Even if the system orders retrieval results by degree of similarity (as they are in free-text search), it can still be difficult to identify the desired shots from text representations of those frames. The Clipmap provides visual recognition as an alternative to examining such text descriptions.
Interactive video objects

Figure 5. A typicuf Cliprimup.

We turn now to the problem of interfaces. Vidco is media rich, providing moving pictures, text, music, and sound. lhus, interfaces based o n keywords or other types of text representation cannot provide users a suitable window on vidco content. Only visual representation can provide an intuitive cue to such content. Furthermore, we should not regard such cues as passive objects. A user should be able to interact witli them, just as text indexes are more for interaction than for examination. In this section, we discuss three approaches to interactivity.

Figure 6. An environment for examining and manipulating micons.

Micons: Icons for video content

A commonly used visual representation of video shots or sequences is the video or movie icon, sometimes called a micon.* Figure 6 illustrates the environment we have designed for examining and manipulating micons. Every video clip has a representative frame that provides a visual cue to its content. This static image is displayed in the upper index strip of Figure 6. Selectingan image from the index strip causes the system to display the entire micon in the left-

Figure 7.A micon with a horizontal slice. Each colored line matches the color of a leotard on the lower leg. Horizontal movement of the line in the image exposed b y the slice corresponds to horizontal movement of the leg (and usually the dancer). Dancers movements can be traced b y using this exposed spatio-temporal picture as a source of cues.

hand window. It also brings up a display of the contents of the Description slot and loads the clip into the soft video player shown on the right. The depth of the micon itself corresponds to the duration of the represented video sequence, and we can use this depth dimension as a scroll bar. Thus, we can use any point along a depth face of the icon to cue the frame corresponding to that point in time for display on the front face. The top and side planes of the icon are the spatiotemporal pictures composed by the pixels along the horizontal and vertical edge of each frame in the video clip.4 This presentation reveals that, at the bit level, a video clip is best thought of as a volume ofpixels, different views of which can provide valuable content information. (For example, the ascent of Apollo 11represented in Figure 6 is captured by the upper face of the icon.) We can also incorporate the camera operation information into the video icon construction and build a VideoSpaceI~on.~ We can examine this volume further by taking horizontal and vertical slices, as indicated by the operator icons on the side of the display in Figure 6. For example, Figure 7 illustrates a horizontal slice through a micon corresponding to an excerpt from Changing Steps,produced by Elliot Caplan and Merce Cunningham, a video interpretation of Cunninghams dance of the same name. (This micon actually does not correspond to a single camera shot. The clip was obtained from the Hierarchical Video Magnifier, discussed in the next subsection.) Note that the slice was taken just above the ankles, so it is possible to trace the movements of the dancers in time through the colored lines created as traces of their leotards. Selecting a representative frame for each camera shot in the index strip is an important issue. Currently, we avoid tools that are too computationally intensive. Two approaches involve simple pixel-based properties. An average frame is defined in which each pixel has the average of values at the same grid point in all frames of the shot. Then, the system will select a frame that is most similar to this average frame as the representative frame. Another approach involves averaging the histograms of all the frames in a clip and selecting the frame whose histogram is closest to the average histogram as the representative frame. However, neither of these approaches involves semantic properties (although the user can always override decisions made by these methods). We also plan to incorporate camera and object

F l l t Edlt Csnl Search Ulcw

tp4

Figure 8. A hierarchical browser of a full-length


video.

motion information either for selecting a representative frame or for constructing a salient still0instead of a representative frame.
Hierarchicalvideo magnifier

Sometimes the ability to browse a video in its entirety is more important than examining individual camera shots in detail. We base our approach on the Hierarchical Video Magnifier. It is illustrated in Figure 8, which presents an overview of the entire Changing Steps video. The original tape of this composition was converted to a QuickTime movie 1,282,602 units long. (There are 600 QuickTime units per second, so this corresponds to a little under 36 minutes.) As the figure shows, dimensions allow for the display of five frames side by side. Therefore, the whole movie is divided into five segments of equal length, each segment represented by the frame at its midpoint. As an example from this particular video, the first segment occupies the first 256,520 units of the movie, and its representative frame is at index 128,260. Each segment can then be similarly expanded by dividing it into five portions of equal length, each represented by the midpoint frame. By the time we get to the third level, we are viewing five equally spaced frames from a segment of

size 51,304 (approximately 85.5 seconds). Users can continue browsing to greater depth, after which the screen scrolls accordingly. The user can also select any frame on the display for storage. The system will store the entire segment represented by the frame as a separate file, which the user can then examine with the micon viewer. (This is how we created the image in Figure 7.) This approach to browsing is particularly valuable for a source like Changing Steps, which does not have a well-definednarrative structure. It can serve equally well for material where the narrative structure is not yet understood. The Hierarchical Video Magnifier is an excellent example of content-free content analysis. The technique requires no information regarding the content of the video other than its duration. We developed it to exploit the results of automatic segmentation. The segment boundaries determined by simple arithmetic division in the Hierarchical Video Magnifier are then justified by being shifted to the nearest camera shot boundary. Thus, at the top levels of the hierarchy, the segments actually correspond to sequences of camera shots, rather than an arbitrary interval of a fixed duration. These camera shot boundaries are honored in the subdivision of all segments that consist of more than a single such shot. When a

Anchorperson
shots

Commercial break shots

Weather forcast shots

News shots

Anchorperson
shots

Figure 9. The temporal structure of a typical news program.

segment contains only one shot, the simple arithmetic division of the Hierarchical Video Magnifier is restored in constructing all subsequent levels of the hierarchy.
Clipmaps

In addition to providing a useful interface for the results of retrieval queries, Clipmaps can also serve as an interactive tool for index construction. In this capacity, the Clipmap plays a role in examining camera shots similar to that of a light table in examining photographic slides. Such a display is very useful in manually sorting the video segments into different categories. It works because the user can maintain several open Clipmap windows. It is thus possible to start with a Clipmap window that is a totally unstructured collection and group segments from a common category into a separate Clipmap. Thus, this feature can be used to form categories by the divide and conquer technique of breaking down a large pile of video icons into smaller piles. Furthermore, the groups created by this process then define the topology of a class hierarchy, such as the one illustrated in Figure 3. While no system is yet sophisticated enough to generate labels or descriptions for these classes automatically, the user can be prompted for such information while seeing a display of the Clipmap corresponding to the class that needs labeling.

Case study of video content analysis


We took a case study approach to validating the tools and techniques discussed in this article. Many of our best results to date have come from analyses of television news programs. As pointed out earlier, content parsing is most feasible when we have an a priori model of a videos structure based on domain knowledge. Such model definition is comparatively easy for news broadcasts. For example, Figure 9 provides a straightforward representation of the temporal structure of a news video.JZIt shows a simple sequence of news items (possibly interleaved with commercials), each of which may include an anchorperson shot at its beginning and/or end.

As a rule, it is not easy to classify individual news shots by structural properties, with the possible exception of certain regular features, such as weather, sports, and business. On the other hand, frames of anchorperson shots have a well defined spatial structure, which can be distinguished from frames of other news shots (see Figure 10). Additionally, a news item in most news programs always starts with an anchorperson shot, followed by a sequence of shots illustrating the news story. Parsing thus relies on classifying each shot according to such temporal and spatial structures. Our approach to news video content parsing begins with identifying key features of the shots, which are then compared to domain models to identify objects inferred to be part o f the domain. Thus, we break news program parsing into three tasks. The first task defines an anchorperson shot model that incorporates both the temporal structure of the shot and the spatial structure of a representative frame. The second task develops similarity measures to be used in matching these models with a given shot as a means of deciding whether that shot is an anchorperson shot. The third task uses a temporal structure model of the entire news program to finalize the shot classification. We developed a set of algorithms that locates anchorperson shots based on the spatial and temporal features of the shots. The system then compares sequences of typed shots to episode models. The algorithms have proved very effective and achieve high accuracy in news video parsing. l 2 We applied the two index schemes discussed earlier-text and visual-to the news programs. The text index uses the topical category tree and assigns news items to classes corresponding to different news topics. The free-text tool can retrieve these news items. However, although we can predefine the category tree structure, we have to insert each news item manually into the tree, which can be a time-consuming and tedious task. The visual index is composed automatically from the parsing processes. We represent each news item visually by a micon in a Clipmap. The cover frames are anchorperson frames containing a news icon, which provides a visual cue to the content of the news item. If there is no anchorperson frame containing a news icon, then the cover frame is the first frame of the first news shot following the anchorperson shot. All icons of the news items belonging to a news program are then presented in a common Clipmap. Currently, we digitize, compress, and save the video data of each news item as a QuickTime file,

together with the soundtrack. A lower level of visual index is provided for each news item. That is, each shot of a news item, segmented during the video parsing process, can be represented as a micon, and we represent each news item in its own Clipmap. This allows direct access to each shot. A user can select the icon and activate a browsing tool to view the contents of a news item, the entire program, or a camera shot within a news item.

News ioon
Awhacpern

AM

Audio analysis: A new frontier As any filmmaker knows, the audio track provides a very rich source of information to supplement our understanding of any video. We can also use this information in video segmentation and indexing tasks. For instance, significant changes in spectral content can serve as segment boundary cues. Furthermore, we assume that we can decompose our auditory perceptions into objects, just as we can our visual perceptions. We currently know far less about just what such audio objects are than we do about corresponding visual 0bje~ts.I~ However, if we can characterize them, we should be able to track them across an auditory stream. This technique should provide useful information for segmentation and indexing. Therefore, effective analysis of audio signals and their integration with information obtained from image analysis is an important part of our work. We have begun to develop algorithms that detect content changes in an audio signal, but the task is challenging. We are investigating changes in both time and frequency domain representations as sources of criteria for segment boundaries. We also plan to develop models of audio events, similar to the models used in image-basedcontent parsing.13For example, in a sports video, very loud shouting followed by a long whistle might indicate that someone has scored a goal, in which case the system should recognize an event. We also anticipate incorporating spectral signatures into these models. Voices of speakers are a key source of content information. Most important here is the problem of identifying different speakers from the audio signal. Although considerable research has addressed this problem, most results are inadequate in the presence of other nonspeech signals. Thus, further study is necessary. Once a speakers voice has been detected and isolated, it should

then also be possible to apply speech-to-text techniques for an additional source of content inforMM mation.

Figure 10. The spatial structure of a frame from an anchorperson shot.

Acknowledgments
Images from the video Changing Steps were reproduced with the kind permission of the Cunningham Dance Foundation. Singapore Broadcasting Corp. gave us permission to reproduce an image from one of their news broadcasts (Figure 10). Also, many programmers contributed to implementing the systems discussed in this article. For these efforts we wish to thank You Hong Tan, Teng Soon Wee, Jian Hua Wu, Siew 1,ian Koh, Wee Kuan Khow, Kwang Meng Tan, and Chin Kai Ong.

References
1 . D. Swanberg, C.-F. Shu, and R. lain, Knowledge Guided Parsing in Video Databases, Proc. Soc.

lmaging Science and Technology (lS&T)/P/ESymp. Electronic Imaging: Science and Technology, held in
San Jose, SPIE, Bellingham, Wash., 1993. 2. H.]. Zhang, A. Kankanhalli, and S.W. Smoliar, Automatic Partitioning of Video, Multimedia

Systems, Vol. 1, No. 1, 1993, pp. 10-28.


3. A. Nagasaka and Y. Tanaka, Automatic Video Indexing and Full-Video Search for Object

Appearances, in Visual Database Systems, Vol. 11, E. Knuth and L.M. Wegner, eds., Elsevier, Amsterdam, 1992, pp. 113-127. 4. A. Akutsu et al. Video Indexing Using Motion Vectors, Proc. SPIE Visual Comm. and lmage Processing 92, SPIE, Bellingham, Wash., 1992, pp. 1,522-1,530,

5 . H.]. Zhang and S.W. Smoliar, Developing Power Tools for Video Indexing and Retrieval, to appear in

Proc. lS&T/SPlSymp. Electronic Imaging: Science and Technology, 1994. 6. C . Salton and M. McCill, lntroduction to Modern lnformation Retried, McCraw-Hill, New York, 1983. 7. W. Niblack et al., The QBlC Project: Querying Images By Content Using Color, Texture and Shape, Proc. lS&T/SPlL:Symp. Electronic Imaging: Science and Technology, SPIE, Bellingham, Wash., 1993. 8. Y. Tonomura, Video Handling Based on Structured lnformation for Hypermedia Systems, Proc. Intl Conf. Multimedia lnformation Systems, McCraw-Hill, Singapore, 1991, pp. 333-344. 9. Y. Tonomura et al., VideoMAP and VideoSpacelcon: Tools for Anatomizing Video Content, Proc. InterCHI 93, ACM Press, New York, 1993, pp. 131-1 36. 10. L. Teodosio and W. Bender, Salient Video Stills: Content and Context Preserved, Proc. ACM Multimedia 93, ACM Press, New York, 1993, pp. 3946. .Y . Wong, A Magnifier 11. M. Mills, J. Cohen, and Y Tool for Video Data, Proc. CHI 92, ACM Press, New York, 1992, pp. 93-98. 12. H.J. Zhang et al., Automatic Parsing of News Video, to appear in Proc. /E lntl Conf. Multimedia Computing and Systems, IEEE Computer Society Press, Los Alamitos, Calif., 1994. 13. M.J. Hawley, Structure Out of Sound, doctoral dissertation, Massachusetts Institute of Technology, Cambridge, 1993. 14.S.W. Smoliar, Classifying Everyday Sounds in Video Annotation, in Multimedia Modeling, T.-S. Chua and T. L. Kunii, eds., World Scientific, Singapore, 1993, pp. 309-313.

Stephen W. Smoliar joined the Institute of Systems Science in 1991, where he currently leads a project on video classification. He also leads a project on using information technology as a teaching medium in music schools. His research interests are in knowledge representation, perceptual categorization, cognitive models, and the application of artificial intelligence to advanced media technologies. Smoliar received his PhD in applied mathematics and his BSc in mathematics from MIT. He also has an extensive background in music and is the sole member of the Society for Music Theory in Singapore.

HongJiang Zhang joined the Institute of Systems Science, National University of Singapore, as a research associate in 1991, where he presently works on projects in video/image indexing and retrieval, moving-object tracking, and image classification. His research interests include image processing, computer vision, digital video, multimedia, and remote sensing. Zhang received his PhD from the Technical University of Denmark in 1991 and his BS from Zhengzhou University, China, in 1981, both in electrical engineering.

Readers may contact Smoliar or Zhang at ISS, National University o f Singapore, Heng Mui Terrace, Kent Ridge, Singapore 0511,e-mail (smoliar,zhang)Biss.nus.sg.

You might also like