SCSI V User Interfaces Searching and Browsing

User interfaces,
searching and browsing
Monica Vladoiu
The DL definition reloaded..
DLs focused collection of digital objects, along with methods for access and retrieval, for selection and organization, and for maintenance of the collection digital objects include text, 2D- or 3D-graphics, animation, audio, video, simulations, dynamic visualisations, and virtual reality worlds the definition accords equal weight to user (access and retrieval) and to librarian (organization and selection, and maintenance)
2/45
What to start with?

Basically, the DL will contain the hypermedia (HM) documents, and the metadata The HM docs are structured objects, which enhances the searching and browsing facilities Searching and browsing are not really so different in practice users interact with information collections in many ways, from searching for particular words or phrases, to browsing for needed information
3/45
Issues to be addressed
what form are the documents in?
what structure do they have?

how do we want them to look?
4/45
Humanity Development Library (1)

large collection of practical information aimed at helping reduce poverty, increasing human potential, and providing a practical and useful education for all. the current version, 2.0, contains 1,230 publications--books, reports, and magazines--in various areas of human development, from agricultural practice to economic policies, from water and sanitation to society and culture, from education to manufacturing, from disaster mitigation to micro-enterprises. it contains a total of 160,000 pages and 30,000 images, which if printed would weigh 340 kg and cost US$20,000. it is available on CD-ROM at US$2 for distribution in developing countries. the objective of the Humanity Libraries Project is to provide all involved in development, well-being and basic needs with access to a complete library of around 3,000 multidisciplinary books containing practical know how and ideas.
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?a=p&p=about&c=hdl http://nzdl.sadl.uleth.ca/cgi-bin/library?a=p&p=home&l=en&w=utf-8
5/45

all books in this DL have front-cover images, and the appropriate image always appears at the top of any page where a book, or part of a book, is displayed this ever-present picture gives a feeling of physical presence, being a constant reminder of the context in which the user is reading the user interface look and feel may be a poor substitute for the real look and feel of a physical book, but its a lot better than nothing the books in HDL are structured in sections and subsections the user can see the full table of contents, with all chapters and their sectiions and subsections included, or the the text of the whole book there are options for Expand contents and Expand text the Detach button duplicates the current window on the screen, which is very useful when comparing multiple documents
0prompt-10---4-----dte--0-1l--11-en-50---20-about-digital+library--00-0-1-00-0-0-11-1-0utfZz-800&cl=search&d=HASHaf4aa0b02748c2f60a8236&gc=1
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=d-00000-00---off-0hdl--00-0--0-10-0---0---
6/45

the DL material has been carefully selected and put together by a dedicated collection editor, who acquired the books, arranged for permission to include each one, organized a massive OCR operation to convert them into electronic form, set and monitored quality control standards for the conversion, decided what form the DL should take and what searching and browsing options should be provided, entered the metadata necessary to build these structures, and checked the integrity of the information and the look and feel of the final product the care and attention put into this is reflected by its high quality nevertheless, its not perfect: there are small OCR errors, and some of the in-text figures are inappropriately sized
the amount of effort required to edit a high-quality, large collection is huge

7/45
Alice in Wonderland by Lewis Carroll

this document has been treated as unstructured text, there is no hierarchical structure here, at least none that is known to the DL system neither are there front-cover images, only a simple display of the title of the book and a page selector that let us turn from one page to another browsing is less convenient because there is less structure to work with even the pages dont correspond to physical pages, but they are arbitrary breaks made by the computer every few hundreds lines, in order to prevent the web browser from downloading the entire book every time its viewed the book does, in fact, have structure, chapters etc., but this is not used and the book is treated as long scroll of plain text the book is stored as raw ASCII text, with the end of each line hard-coded in the document, rather than HTML thats way the lines of text are quite short, they cannot expand to fill the browser window compared to the previous HDL collection, this is a low-quality, unattractive DL this can be improved with costs that depend on how similar the books are and how regular their structure is
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=d-00000-00---off-0gberg--00-0--0-10-0---0--0prompt-10---4-----dtt--0-1l--11-en-50---20-about-alice+wonderland--00-0-1-00-0-0-11-1-0utfZz-800&d=HASH012d0bdace5d9e9dafe83634&cl=search&gp=1
8/45
Project Gutenberg (1)

the previous book resides in the Gutenberg collection Project Gutenbergs goal is to encourage the creation and distribution of electronic text although conceived in 1971 with the ambitious aim of a trillion electronic literature files by the year of 2001, work did not begin in earnest until 1991, and the aim was scaled back to 10000 electronic texts within 10 years now there are over 20,000 free books in the Online Book Catalog copyright for most of these books has expired in the United States.
see: http://www.gutenberg.org/wiki/Main_Page
9/45
Project Gutenberg (2)

P.G. is a grass roots phenomenon. Text is input by volunteers, each of whom can enter a book a year or even just one book in a lifetime the material to be added in is at the volunteers choice they are encouraged to choose books they like and get them in the manner that is most comfortable for them central to project philosophy is to represent books as plain text, with no formatting and no metadata, therefore little effort has been made to pretty up this collection quality control is a serious problem P.G. is remarkably visionary and gives an interesting perspective on the potential role of volunteer labor in placing societys literacy treasures in the public domain
see: http://www.gutenberg.org/wiki/Main_Page
10/45
Page images (1)

one way to show books pages is as digitized images rather than as the text extracted from them from technical point of view there is a big difference: a textual representation generally occupies about 1/20 as much as storage space as a page image that reduces significantly the space required to store the collection and the time needed to download each page though, one good reason for showing page images rather than extracted text is to avoid OCR errors
11/45
Page images (2)

a disadvantage of showing page images is that is hard to find search terms on the page also it is hard to have special formattings, as highlighting, underlying, etc., on some particular words or phrases some collections keep both forms to cumulate both benefits: no OCR errors and term searching The New Zealand Maori newspapers collection
see: http://www.nzdl.org - Maori Niupepa
12/45
Audio and video

including audio and video as documents in DLs is easy web browsers, suitably equipped with plug-ins can play audio or video in a large variety of formats large storage space is needed bandwidth and other technical factors need to be considered carefully to get access to the information of interest, keeping metadata about audios and videos is crucial The New Zealand Oral History collection
see: http://www.nzdl.org - Oral History
13/45
Music
digital collection of music capture popular imagination in ways that scholarly libraries will never do the music representation is made by an OMR program, which is similar to OCR, and works with printed music (scanned page of music book) lyrics should be made available too MIDI (musical instrument digital interface) is the standard used by the electronic music industry a music DL needs 2 major capabilities: to convert between different formats and to locate the relevant information The New Zealand Melody Index collection
see: http://www.nzdl.org - Melody Index
14/45
Presenting metadata (1)

traditional libraries manage their holdings using catalogs that contain information about every object they own metadata is information in a structured format, which purpose is to provide a description of other data objects to facilitate access to them moreover, metadata elements are standardized so that the same type of information can be used in different systems and for different purposes
15/45
Presenting metadata (2)

for an e-book information about the title, authors, date, publisher, and source of the original copy is represented as metadata for a paper that is expanded with the title of the publication in which the article appear, volume number, issue number and page numbers these are standard bibliographic metadata items also included can be the URL of the source bibliography and the abstract
see: http://www.nzdl.org - Computer Science Bibliographies http://liinwww.ira.uka.de/bibliography/Misc/index.html
16/45
Presenting metadata (3) - Metadata features

metadata has many different aspects that corresponds to different kinds of information about an item historical features describe provenance, form, and preservation history functional features describe usage, condition, and audience technical features provide information that promote interoperability between different systems relational metadata covers links and citations intellectual metadata describes the content or subject
17/45
Presenting metadata (4) Library metadata

library metadata is standardized there are many different ones to choose from for example, to retrieve the metadata over the Internet from the Library of Congress information service is used an information interchange standard called Z39.50 that is widely used throughout library world is represented in a record format called MARC (MAchine Readable Cataloging), which is also used internationally MARC comes in more than 20 variants that are produced for different countries
see: http://catalog.loc.gov/ http://catalog.loc.gov/cgibin/Pwebrecon.cgi?v1=7&ti=1,7&Search%5FArg=witten%20i%2E%20h%2E&Search%5FCode=NAME%5F &CNT=25&PID=8603&SEQ=20071102144304&SID=2 http://books.google.com/
18/45
Presenting metadata (5) what it does

provides assistance with search and retrieval gives information about usage in terms of authorization, copyright and licensing addresses quality issues such as authentication and rating promotes system interoperability metadata descriptions often grow willy-nilly and therefore text searching prevails over searching a structured database new metadata international standards are under development, but that requires a lot of hard work, negotiation, and compromise; it takes years
19/45
SEARCHING (1)
electronic document delivery is the first raison detre for most digital libraries conventional automated library searches are restricted to metadata DLs have access to the entire contents of the object they contain this is a great advantage
20/45
Searching (2)
in DLs, especially those for non-scholar users, search should satisfy the usual user needs more advanced search should be also possible as Alan Kay, a leading early proponent of the visual paradigm for HCI, said: simple things should be
simple, complex things should be possible
typically a search screen allows to choose first:
the type of search (basic, advanced) the language the unit of search (paragraphs, sections, documents as full text, and section titles, document titles, and author as metadata elements) 21/45
Searching (3) - types of query

in Information Retrieval, an important distinction is made between Boolean and ranked queries both include a list of terms to be sought in a text in a Boolean query terms are combined using the connectives AND, OR, and NOT the query responses are those units of text that satisfy the stipulated condition in a ranked query the list of terms is treated as a small document in its own right units of text that are similar to it are sought, ranked in order of the degree of match
22/45
Searching (4) - types of query

it may be more logical to view ranking as a separate operation that can be applied to any kind of query from this perspective, what we call a ranked query is usually an OR query, which seeks docs that contain any of the specified words, followed by a ranking operation AND is the most common Boolean query type AND means that all of the words (or lexical equivalents) in the query must occur in the answer thus if we look for digital AND library both the docs that contain library management in the digital age and software library for digital signal processing are correct answers to the query
23/45
Searching (5) recall and precision

information retrieval (IR) systems inevitably return some answers that are not relevant user must filter these out manually there is a difficult choice between:
a broad search that identify virtually all the relevant docs is said to have high recall one in which virtually all retrieved docs are relevant has high precision
24/45
casting a broad query to be sure of retrieving all relevant material, albeit diluted with many irrelevant answers addressing a narrow one, where most retrieved docs are of interest but others slip through the net because the query is too restrictive
Searching (6) recall and precision

an enduring theme in IR is the tension between these when a user cast a search s/he should formulate the query according with which one s/he prefers in typical Web searches precision is generally more sought after than recall its so much out there that one probably couldnt handle every relevant docs however, if the searcher is a defense counselor looking for precedents of a legal case, recall is better, as every relevant precedent should be checked out
25/45
Searching (7) extra terms

another problem is that small variations of a query can lead to quite different results to catch all desired documents, professional librarians add extra-terms to queries ex: (digital OR virtual OR electronic) AND (library OR (document AND collection)) for non professional simple list of words of interest are preferred ex: digital, virtual, electronic, library, document, collection identifying docs relevant to a list of terms is not just a matter of converting it to a Boolean query by using AND/OR very few/too many docs are likely to match
26/45
Searching (8) similarity

the solution is to use a ranked query, which applies some kind of artificial measure that evaluate the similarity of each document to the query based on this numeric indicator, a fixed number of the closest matching documents are returned as answers if the measure is good, and only a few documents are returned, they will contain predominantly relevant answers that means high precision if many docs are returned, most of the relevant docs will be included which is high recall in practice, high recall goes with low precision and vice versa
27/45
Searching (9) - ranking

great effort has been invested in a quest for similarity measures and other ranking strategies that succeed in keeping both recall and precision reasonably high simple techniques just count the number of query terms that appear somewhere in the document an obvious drawback is that long docs are favored many ranking techniques assign a numeric weight to each term based on its frequency in the docs collection its difficult to describe ranking in a few words or even the ranking idea to casual users
28/45
common terms receive low weight long docs are no longer favored
Searching (10) Boolean or ranking?

professional IR specialists like librarians want to understand exactly how their queries will be interpreted and are willing to prepare complex queries for most tasks they prefer Boolean queries these are especially appropriate if it is metadata that is being searched, and particularly if professional catalogers have enter it casual users prefer ranked queries that are very suitable if full text is being searched they trust the system will perform well, and are willing to scroll down through the ranked list
29/45
Searching (11) some early solutions

a compromise between Boolean and ranked queries emerged in early Internet search engines by default they treated queries as rnked, but allowed users more precise control by indicating certain words that must appear in the text of every answer (by a preceding + sign) and others that must not (- preceded) as the Web grew and the quest for precision began to prevail on recall, some search engines began to return only the docs that contained all of the search terms a generalization of these ideas is to undertake a full Boolean search and to rank the results
30/45
Searching (12) constraints

its difficult to design querying methods that scale up satisfactory to hundreds of millions of documents, particularly given that queries must be answered almost immediately long Boolean expressions are hard to enter, manage, and refine, tipping the balance toward automatic methods of ranking to assist users in their quest for satisfactory means of information retrieval
see: http://www.google.com/advanced_search?hl=en
31/45
Searching(13) case-folding and stemming

querying needs 2 operations: case-folding and stemming if case differences are ignored Digital and digital and DIGITAL are the same case-folding is replacing all uppercase characters in the query with their lowercase equivalents stemming relaxes the match between query terms and words in the documents so that, e.g. libraries is accepted as equivalent to library stemming is reducing a word to its neutral stem, for example libraries and library to librar case-folding and stemming are language dependent
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=q-00000-00---off-0gberg--00-0--0-10-0---0--0prompt-10---4-----dtt--0-1l--11-en-50---20-about-digital--00-0-1-00-0-0-11-1-0utfZz-800&a=p&p=preferences
32/45
Searching (14) phrase searching

user often want to specify that the search is for contiguous groups of words (phrases) this is indicated in a query by putting the phrase in quotation marks (e.g. digital libraries) phrases complicated ranked searching phrase frequency instead of words (in it) frequency determine its influence from a users point of view, phrase searching is a simple and natural extension to idea of searching they imagine the computer is looking through all the docs, as a human would do, but a lot faster
33/45
Searching (15) word searching

when computers search, they dont scan through the text as a person would do cuz that would take too long instead computers first create an index that records, for each word, the documents that contain that word then every word in the query is looked up in the index to get a list of document numbers finally, the query is answered by manipulating these lists for example, in a Boolean, AND query, by checking which docs are in all the lists
34/45
Searching (16) phrase searching

phrase searching changes everything no longer queries can be answered simply by manipulating the lists of document numbers there are 2 quite different ways to proceed:
postretrieval scan (prs): look inside and check through all docs that contain the query terms to see if they occur together as a phrase word-level index (wli): record in the index: document numbers, word numbers and word position in the doc then if two words are numbered consecutively within a doc, they form a phrase
see: http://nzdl.sadl.uleth.ca/cgi-bin/library?e=q-00000-00---off-0gberg--00-0--0-10-0---0--0prompt-10---4-----dtt--0-1l--11-en-50---20-about-digital--00-0-1-00-0-0-11-1-0utfZz-800&a=p&p=preferences
35/45
Searching(17) phrase searching efficiency

the mechanism used for phrase searching greatly affects the resources required by an IR system and its performance with prs:
with wli:
only a doc-level index is needed it takes a lot of time to respond because many docs might have to be scanned, especially for common words in the query the index is significantly larger response time is much smaller punctuation and white spaces can be indexed as well 36/45
Searching (18) phr. search: Practical decisions

a word-level index will be included if it is feasible, if phrase searching is likely to be common, and if the space occupied by the system is not a significant constraint in simple systems a postretrieval scan will be used, particularly if phrase searching will be rare in either case, ranking will be based only on individual word frequencies, for practical reasons
37/45
Searching (19) Query interface requirements

different query interfaces are suitable for different tasks studies have shown that the most common number of terms in actual queries to actual Web search systems is one or two modern search engines can deal with large queries a useful feature for all kinds of search is to allow users to examine and reuse their search history often searches on different fields need to be combined one might look for a book by a certain author, with a particular word in title, having a given subject or with a particular phrase and so on, in the main text
see: http://ask.bibsys.no/ask/action/stdsearch http://www.gutenberg.org/catalog/world/search
38/45
SEARCHING vs. BROWSING

searching and browsing are the 2 sides of the coin, but there is an entire spectrum between the two searching is to look into or over carefully or thoroughly in an effort to find or discover something browsing consists of to look over or through an aggregate of things casually especially in search of smth. of interest searching is purposeful, and browsing tends to be casual searching implies the user knows what s/he is looking for browsing is like Ill know it when Ill see it
see: http://m-w.com/dictionary/search
39/45
Browsing ordered lists

the metadata provided with the documents in a collection offer handles for different kinds of browsing activities information collections that are entirely devoid of metadata can be searched, but they cannot be browsed the metadata structure is the key for the ability to browse the simplest and most rudimentary browsing structure is the ordered list ordering can be alphabetical by title or author, by date etc. some languages are not alphabetic (Chinese, Arabic) for Chinese, characters can be ordered according to the number of strokes they contain
see:http://books.google.com/books?q=subject:%22+Literature+%22&as_brr=3
40/45
Browsing hierarchical classification structures

the linear browsing works if the number of docs is small hierarchical classification structures are standard tools if the number of objects is significant in the library world, the Library of Congress classification and the Dewey Decimal Classification are used to arrange printed books in categories the goal is to place volumes treating the same or similar subjects next to each other on the library shelves these schemes are hierarchical: the early parts of the code provide a rough categorization that is refined by the later characters e.g. DDC: 330 for economics + 9 for
geographic treatment + 4 for Europe = 330.94 European economy;
see: http://en.wikipedia.org/wiki/Dewey_Decimal_Classification
41/45
Browsing classification scheme

the classification scheme of a DL could be standard or nonstandard it depends on the collection editor the most appropriate scheme for collections users developers of DL systems have to decide whether to try to impose uniformity on the people who build collections, or whether instead to provide flexibility for them to organize things in the way they see fit options for the latter gives librarians freedom to exercise their professional judgment effectively
42/45
Phrase browsing
people want to browse information collection based on their subject matter that kind of browsing is well supported by displays based on hierarchical classification metadata that is associated with each document but manual classification is expensive and tedious for large document collections to address this issue, one can build topical browsing interfaces based on phrase metadata, where the phrases have been extracted automatically from the full text of the documents themselves representative key phrases can be chosen 43/45
Browsing using extracted metadata

the browsing methods rely on metadata that must be provided for all documents in the collection this can be automatically extracted from the docs full text titles may be identified by seeking capitalized text close to the beginning of documents names may be identified by looking for the capitalization and punctuation patterns that characterize forms such as Surname, Forename and Forename Initial. Surname data could also be identified based on data patterns then that metadata can be indexed and used there will be some residual errors, but the mechanism 44/45 is still very useful
The DL definition reloaded..

DLs focused collection of digital objects, along with methods for access and retrieval, for selection and organization, and for maintenance of the collection the definition accords equal weight to user (access and retrieval) and to librarian (organization and selection, and maintenance)
45/45

SCSI V User Interfaces Searching and Browsing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SCSI V User Interfaces Searching and Browsing

Uploaded by

Copyright:

Available Formats

User interfaces,

searching and browsing

The DL definition reloaded..

What to start with?

what structure do they have?

Humanity Development Library (1)

see: http://nzdl.sadl.uleth.ca/cgi-bin/library?a=p&p=about&c=hdl http://nzdl.sadl.uleth.ca/cgi-bin/library?a=p&p=home&l=en&w=utf-8

Humanity Development Library (2)

Humanity Development Library (3)

the amount of effort required to edit a high-quality, large collection is huge

Alice in Wonderland by Lewis Carroll

Project Gutenberg (1)

Project Gutenberg (2)

Page images (1)

Page images (2)

Audio and video

Presenting metadata (1)

Presenting metadata (2)

Presenting metadata (3) - Metadata features

Presenting metadata (4) Library metadata

see: http://catalog.loc.gov/ http://catalog.loc.gov/cgibin/Pwebrecon.cgi?v1=7&ti=1,7&Search%5FArg=witten%20i%2E%20h%2E&Search%5FCode=NAME%5F &CNT=25&PID=8603&SEQ=20071102144304&SID=2 http://books.google.com/

Presenting metadata (5) what it does

simple, complex things should be possible

typically a search screen allows to choose first:

Searching (3) - types of query

Searching (4) - types of query

Searching (5) recall and precision

Searching (6) recall and precision

Searching (7) extra terms

Searching (8) similarity

Searching (9) - ranking

Searching (10) Boolean or ranking?

Searching (11) some early solutions

Searching (12) constraints

Searching(13) case-folding and stemming

Searching (14) phrase searching

Searching (15) word searching

Searching (16) phrase searching

Searching(17) phrase searching efficiency

Searching (18) phr. search: Practical decisions

Searching (19) Query interface requirements

SEARCHING vs. BROWSING

Browsing ordered lists

Browsing hierarchical classification structures

Browsing classification scheme

Browsing using extracted metadata

The DL definition reloaded..

You might also like