You are on page 1of 45

User interfaces,

searching and browsing

Monica Vladoiu

The DL definition reloaded..

DLs focused collection of digital objects, along with methods for access and retrieval, for selection and organization, and for maintenance of the collection digital objects include text, 2D- or 3D-graphics, animation, audio, video, simulations, dynamic visualisations, and virtual reality worlds the definition accords equal weight to user (access and retrieval) and to librarian (organization and selection, and maintenance)
2/45

What to start with?


Basically, the DL will contain the hypermedia (HM) documents, and the metadata The HM docs are structured objects, which enhances the searching and browsing facilities Searching and browsing are not really so different in practice users interact with information collections in many ways, from searching for particular words or phrases, to browsing for needed information
3/45

Issues to be addressed
what form are the documents in?

what structure do they have?


how do we want them to look?

4/45

Humanity Development Library (1)


large collection of practical information aimed at helping reduce poverty, increasing human potential, and providing a practical and useful education for all. the current version, 2.0, contains 1,230 publications--books, reports, and magazines--in various areas of human development, from agricultural practice to economic policies, from water and sanitation to society and culture, from education to manufacturing, from disaster mitigation to micro-enterprises. it contains a total of 160,000 pages and 30,000 images, which if printed would weigh 340 kg and cost US$20,000. it is available on CD-ROM at US$2 for distribution in developing countries. the objective of the Humanity Libraries Project is to provide all involved in development, well-being and basic needs with access to a complete library of around 3,000 multidisciplinary books containing practical know how and ideas.

see: http://nzdl.sadl.uleth.ca/cgi-bin/library?a=p&p=about&c=hdl http://nzdl.sadl.uleth.ca/cgi-bin/library?a=p&p=home&l=en&w=utf-8

5/45

Humanity Development Library (2)


all books in this DL have front-cover images, and the appropriate image always appears at the top of any page where a book, or part of a book, is displayed this ever-present picture gives a feeling of physical presence, being a constant reminder of the context in which the user is reading the user interface look and feel may be a poor substitute for the real look and feel of a physical book, but its a lot better than nothing the books in HDL are structured in sections and subsections the user can see the full table of contents, with all chapters and their sectiions and subsections included, or the the text of the whole book there are options for Expand contents and Expand text the Detach button duplicates the current window on the screen, which is very useful when comparing multiple documents
0prompt-10---4-----dte--0-1l--11-en-50---20-about-digital+library--00-0-1-00-0-0-11-1-0utfZz-800&cl=search&d=HASHaf4aa0b02748c2f60a8236&gc=1

see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=d-00000-00---off-0hdl--00-0--0-10-0---0---

6/45

Humanity Development Library (3)


the DL material has been carefully selected and put together by a dedicated collection editor, who acquired the books, arranged for permission to include each one, organized a massive OCR operation to convert them into electronic form, set and monitored quality control standards for the conversion, decided what form the DL should take and what searching and browsing options should be provided, entered the metadata necessary to build these structures, and checked the integrity of the information and the look and feel of the final product the care and attention put into this is reflected by its high quality nevertheless, its not perfect: there are small OCR errors, and some of the in-text figures are inappropriately sized

the amount of effort required to edit a high-quality, large collection is huge


7/45

Alice in Wonderland by Lewis Carroll


this document has been treated as unstructured text, there is no hierarchical structure here, at least none that is known to the DL system neither are there front-cover images, only a simple display of the title of the book and a page selector that let us turn from one page to another browsing is less convenient because there is less structure to work with even the pages dont correspond to physical pages, but they are arbitrary breaks made by the computer every few hundreds lines, in order to prevent the web browser from downloading the entire book every time its viewed the book does, in fact, have structure, chapters etc., but this is not used and the book is treated as long scroll of plain text the book is stored as raw ASCII text, with the end of each line hard-coded in the document, rather than HTML thats way the lines of text are quite short, they cannot expand to fill the browser window compared to the previous HDL collection, this is a low-quality, unattractive DL this can be improved with costs that depend on how similar the books are and how regular their structure is
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=d-00000-00---off-0gberg--00-0--0-10-0---0--0prompt-10---4-----dtt--0-1l--11-en-50---20-about-alice+wonderland--00-0-1-00-0-0-11-1-0utfZz-800&d=HASH012d0bdace5d9e9dafe83634&cl=search&gp=1

8/45

Project Gutenberg (1)


the previous book resides in the Gutenberg collection Project Gutenbergs goal is to encourage the creation and distribution of electronic text although conceived in 1971 with the ambitious aim of a trillion electronic literature files by the year of 2001, work did not begin in earnest until 1991, and the aim was scaled back to 10000 electronic texts within 10 years now there are over 20,000 free books in the Online Book Catalog copyright for most of these books has expired in the United States.
see: http://www.gutenberg.org/wiki/Main_Page

9/45

Project Gutenberg (2)


P.G. is a grass roots phenomenon. Text is input by volunteers, each of whom can enter a book a year or even just one book in a lifetime the material to be added in is at the volunteers choice they are encouraged to choose books they like and get them in the manner that is most comfortable for them central to project philosophy is to represent books as plain text, with no formatting and no metadata, therefore little effort has been made to pretty up this collection quality control is a serious problem P.G. is remarkably visionary and gives an interesting perspective on the potential role of volunteer labor in placing societys literacy treasures in the public domain

see: http://www.gutenberg.org/wiki/Main_Page

10/45

Page images (1)


one way to show books pages is as digitized images rather than as the text extracted from them from technical point of view there is a big difference: a textual representation generally occupies about 1/20 as much as storage space as a page image that reduces significantly the space required to store the collection and the time needed to download each page though, one good reason for showing page images rather than extracted text is to avoid OCR errors
11/45

Page images (2)


a disadvantage of showing page images is that is hard to find search terms on the page also it is hard to have special formattings, as highlighting, underlying, etc., on some particular words or phrases some collections keep both forms to cumulate both benefits: no OCR errors and term searching The New Zealand Maori newspapers collection
see: http://www.nzdl.org - Maori Niupepa

12/45

Audio and video


including audio and video as documents in DLs is easy web browsers, suitably equipped with plug-ins can play audio or video in a large variety of formats large storage space is needed bandwidth and other technical factors need to be considered carefully to get access to the information of interest, keeping metadata about audios and videos is crucial The New Zealand Oral History collection
see: http://www.nzdl.org - Oral History

13/45

Music

digital collection of music capture popular imagination in ways that scholarly libraries will never do the music representation is made by an OMR program, which is similar to OCR, and works with printed music (scanned page of music book) lyrics should be made available too MIDI (musical instrument digital interface) is the standard used by the electronic music industry a music DL needs 2 major capabilities: to convert between different formats and to locate the relevant information The New Zealand Melody Index collection
see: http://www.nzdl.org - Melody Index

14/45

Presenting metadata (1)


traditional libraries manage their holdings using catalogs that contain information about every object they own metadata is information in a structured format, which purpose is to provide a description of other data objects to facilitate access to them moreover, metadata elements are standardized so that the same type of information can be used in different systems and for different purposes

15/45

Presenting metadata (2)


for an e-book information about the title, authors, date, publisher, and source of the original copy is represented as metadata for a paper that is expanded with the title of the publication in which the article appear, volume number, issue number and page numbers these are standard bibliographic metadata items also included can be the URL of the source bibliography and the abstract
see: http://www.nzdl.org - Computer Science Bibliographies http://liinwww.ira.uka.de/bibliography/Misc/index.html

16/45

Presenting metadata (3) - Metadata features


metadata has many different aspects that corresponds to different kinds of information about an item historical features describe provenance, form, and preservation history functional features describe usage, condition, and audience technical features provide information that promote interoperability between different systems relational metadata covers links and citations intellectual metadata describes the content or subject

17/45

Presenting metadata (4) Library metadata


library metadata is standardized there are many different ones to choose from for example, to retrieve the metadata over the Internet from the Library of Congress information service is used an information interchange standard called Z39.50 that is widely used throughout library world is represented in a record format called MARC (MAchine Readable Cataloging), which is also used internationally MARC comes in more than 20 variants that are produced for different countries

see: http://catalog.loc.gov/ http://catalog.loc.gov/cgibin/Pwebrecon.cgi?v1=7&ti=1,7&Search%5FArg=witten%20i%2E%20h%2E&Search%5FCode=NAME%5F &CNT=25&PID=8603&SEQ=20071102144304&SID=2 http://books.google.com/

18/45

Presenting metadata (5) what it does


provides assistance with search and retrieval gives information about usage in terms of authorization, copyright and licensing addresses quality issues such as authentication and rating promotes system interoperability metadata descriptions often grow willy-nilly and therefore text searching prevails over searching a structured database new metadata international standards are under development, but that requires a lot of hard work, negotiation, and compromise; it takes years

19/45

SEARCHING (1)
electronic document delivery is the first raison detre for most digital libraries conventional automated library searches are restricted to metadata DLs have access to the entire contents of the object they contain this is a great advantage

20/45

Searching (2)
in DLs, especially those for non-scholar users, search should satisfy the usual user needs more advanced search should be also possible as Alan Kay, a leading early proponent of the visual paradigm for HCI, said: simple things should be

simple, complex things should be possible

typically a search screen allows to choose first:

the type of search (basic, advanced) the language the unit of search (paragraphs, sections, documents as full text, and section titles, document titles, and author as metadata elements) 21/45

Searching (3) - types of query


in Information Retrieval, an important distinction is made between Boolean and ranked queries both include a list of terms to be sought in a text in a Boolean query terms are combined using the connectives AND, OR, and NOT the query responses are those units of text that satisfy the stipulated condition in a ranked query the list of terms is treated as a small document in its own right units of text that are similar to it are sought, ranked in order of the degree of match
22/45

Searching (4) - types of query


it may be more logical to view ranking as a separate operation that can be applied to any kind of query from this perspective, what we call a ranked query is usually an OR query, which seeks docs that contain any of the specified words, followed by a ranking operation AND is the most common Boolean query type AND means that all of the words (or lexical equivalents) in the query must occur in the answer thus if we look for digital AND library both the docs that contain library management in the digital age and software library for digital signal processing are correct answers to the query
23/45

Searching (5) recall and precision


information retrieval (IR) systems inevitably return some answers that are not relevant user must filter these out manually there is a difficult choice between:

a broad search that identify virtually all the relevant docs is said to have high recall one in which virtually all retrieved docs are relevant has high precision
24/45

casting a broad query to be sure of retrieving all relevant material, albeit diluted with many irrelevant answers addressing a narrow one, where most retrieved docs are of interest but others slip through the net because the query is too restrictive

Searching (6) recall and precision


an enduring theme in IR is the tension between these when a user cast a search s/he should formulate the query according with which one s/he prefers in typical Web searches precision is generally more sought after than recall its so much out there that one probably couldnt handle every relevant docs however, if the searcher is a defense counselor looking for precedents of a legal case, recall is better, as every relevant precedent should be checked out

25/45

Searching (7) extra terms


another problem is that small variations of a query can lead to quite different results to catch all desired documents, professional librarians add extra-terms to queries ex: (digital OR virtual OR electronic) AND (library OR (document AND collection)) for non professional simple list of words of interest are preferred ex: digital, virtual, electronic, library, document, collection identifying docs relevant to a list of terms is not just a matter of converting it to a Boolean query by using AND/OR very few/too many docs are likely to match
26/45

Searching (8) similarity


the solution is to use a ranked query, which applies some kind of artificial measure that evaluate the similarity of each document to the query based on this numeric indicator, a fixed number of the closest matching documents are returned as answers if the measure is good, and only a few documents are returned, they will contain predominantly relevant answers that means high precision if many docs are returned, most of the relevant docs will be included which is high recall in practice, high recall goes with low precision and vice versa
27/45

Searching (9) - ranking


great effort has been invested in a quest for similarity measures and other ranking strategies that succeed in keeping both recall and precision reasonably high simple techniques just count the number of query terms that appear somewhere in the document an obvious drawback is that long docs are favored many ranking techniques assign a numeric weight to each term based on its frequency in the docs collection its difficult to describe ranking in a few words or even the ranking idea to casual users
28/45
common terms receive low weight long docs are no longer favored

Searching (10) Boolean or ranking?


professional IR specialists like librarians want to understand exactly how their queries will be interpreted and are willing to prepare complex queries for most tasks they prefer Boolean queries these are especially appropriate if it is metadata that is being searched, and particularly if professional catalogers have enter it casual users prefer ranked queries that are very suitable if full text is being searched they trust the system will perform well, and are willing to scroll down through the ranked list
29/45

Searching (11) some early solutions


a compromise between Boolean and ranked queries emerged in early Internet search engines by default they treated queries as rnked, but allowed users more precise control by indicating certain words that must appear in the text of every answer (by a preceding + sign) and others that must not (- preceded) as the Web grew and the quest for precision began to prevail on recall, some search engines began to return only the docs that contained all of the search terms a generalization of these ideas is to undertake a full Boolean search and to rank the results
30/45

Searching (12) constraints


its difficult to design querying methods that scale up satisfactory to hundreds of millions of documents, particularly given that queries must be answered almost immediately long Boolean expressions are hard to enter, manage, and refine, tipping the balance toward automatic methods of ranking to assist users in their quest for satisfactory means of information retrieval
see: http://www.google.com/advanced_search?hl=en

31/45

Searching(13) case-folding and stemming


querying needs 2 operations: case-folding and stemming if case differences are ignored Digital and digital and DIGITAL are the same case-folding is replacing all uppercase characters in the query with their lowercase equivalents stemming relaxes the match between query terms and words in the documents so that, e.g. libraries is accepted as equivalent to library stemming is reducing a word to its neutral stem, for example libraries and library to librar case-folding and stemming are language dependent
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=q-00000-00---off-0gberg--00-0--0-10-0---0--0prompt-10---4-----dtt--0-1l--11-en-50---20-about-digital--00-0-1-00-0-0-11-1-0utfZz-800&a=p&p=preferences

32/45

Searching (14) phrase searching


user often want to specify that the search is for contiguous groups of words (phrases) this is indicated in a query by putting the phrase in quotation marks (e.g. digital libraries) phrases complicated ranked searching phrase frequency instead of words (in it) frequency determine its influence from a users point of view, phrase searching is a simple and natural extension to idea of searching they imagine the computer is looking through all the docs, as a human would do, but a lot faster
33/45

Searching (15) word searching


when computers search, they dont scan through the text as a person would do cuz that would take too long instead computers first create an index that records, for each word, the documents that contain that word then every word in the query is looked up in the index to get a list of document numbers finally, the query is answered by manipulating these lists for example, in a Boolean, AND query, by checking which docs are in all the lists

34/45

Searching (16) phrase searching


phrase searching changes everything no longer queries can be answered simply by manipulating the lists of document numbers there are 2 quite different ways to proceed:
postretrieval scan (prs): look inside and check through all docs that contain the query terms to see if they occur together as a phrase word-level index (wli): record in the index: document numbers, word numbers and word position in the doc then if two words are numbered consecutively within a doc, they form a phrase

see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=q-00000-00---off-0gberg--00-0--0-10-0---0--0prompt-10---4-----dtt--0-1l--11-en-50---20-about-digital--00-0-1-00-0-0-11-1-0utfZz-800&a=p&p=preferences

35/45

Searching(17) phrase searching efficiency


the mechanism used for phrase searching greatly affects the resources required by an IR system and its performance with prs:

with wli:

only a doc-level index is needed it takes a lot of time to respond because many docs might have to be scanned, especially for common words in the query the index is significantly larger response time is much smaller punctuation and white spaces can be indexed as well 36/45

Searching (18) phr. search: Practical decisions


a word-level index will be included if it is feasible, if phrase searching is likely to be common, and if the space occupied by the system is not a significant constraint in simple systems a postretrieval scan will be used, particularly if phrase searching will be rare in either case, ranking will be based only on individual word frequencies, for practical reasons

37/45

Searching (19) Query interface requirements


different query interfaces are suitable for different tasks studies have shown that the most common number of terms in actual queries to actual Web search systems is one or two modern search engines can deal with large queries a useful feature for all kinds of search is to allow users to examine and reuse their search history often searches on different fields need to be combined one might look for a book by a certain author, with a particular word in title, having a given subject or with a particular phrase and so on, in the main text
see: http://ask.bibsys.no/ask/action/stdsearch http://www.gutenberg.org/catalog/world/search

38/45

SEARCHING vs. BROWSING


searching and browsing are the 2 sides of the coin, but there is an entire spectrum between the two searching is to look into or over carefully or thoroughly in an effort to find or discover something browsing consists of to look over or through an aggregate of things casually especially in search of smth. of interest searching is purposeful, and browsing tends to be casual searching implies the user knows what s/he is looking for browsing is like Ill know it when Ill see it

see: http://m-w.com/dictionary/search

39/45

Browsing ordered lists


the metadata provided with the documents in a collection offer handles for different kinds of browsing activities information collections that are entirely devoid of metadata can be searched, but they cannot be browsed the metadata structure is the key for the ability to browse the simplest and most rudimentary browsing structure is the ordered list ordering can be alphabetical by title or author, by date etc. some languages are not alphabetic (Chinese, Arabic) for Chinese, characters can be ordered according to the number of strokes they contain
see:http://books.google.com/books?q=subject:%22+Literature+%22&as_brr=3

40/45

Browsing hierarchical classification structures


the linear browsing works if the number of docs is small hierarchical classification structures are standard tools if the number of objects is significant in the library world, the Library of Congress classification and the Dewey Decimal Classification are used to arrange printed books in categories the goal is to place volumes treating the same or similar subjects next to each other on the library shelves these schemes are hierarchical: the early parts of the code provide a rough categorization that is refined by the later characters e.g. DDC: 330 for economics + 9 for
geographic treatment + 4 for Europe = 330.94 European economy;
see: http://en.wikipedia.org/wiki/Dewey_Decimal_Classification

41/45

Browsing classification scheme


the classification scheme of a DL could be standard or nonstandard it depends on the collection editor the most appropriate scheme for collections users developers of DL systems have to decide whether to try to impose uniformity on the people who build collections, or whether instead to provide flexibility for them to organize things in the way they see fit options for the latter gives librarians freedom to exercise their professional judgment effectively

42/45

Phrase browsing
people want to browse information collection based on their subject matter that kind of browsing is well supported by displays based on hierarchical classification metadata that is associated with each document but manual classification is expensive and tedious for large document collections to address this issue, one can build topical browsing interfaces based on phrase metadata, where the phrases have been extracted automatically from the full text of the documents themselves representative key phrases can be chosen 43/45

Browsing using extracted metadata


the browsing methods rely on metadata that must be provided for all documents in the collection this can be automatically extracted from the docs full text titles may be identified by seeking capitalized text close to the beginning of documents names may be identified by looking for the capitalization and punctuation patterns that characterize forms such as Surname, Forename and Forename Initial. Surname data could also be identified based on data patterns then that metadata can be indexed and used there will be some residual errors, but the mechanism 44/45 is still very useful

The DL definition reloaded..


DLs focused collection of digital objects, along with methods for access and retrieval, for selection and organization, and for maintenance of the collection the definition accords equal weight to user (access and retrieval) and to librarian (organization and selection, and maintenance)
45/45

You might also like