You are on page 1of 9

An Introduction to Metadata Page 1 of 9

UQ HOME SEARCH CONTACTS STUDY NEWS EVENTS MAPS LIBRARY


my.UQ

An Introduction to
Metadata
UQ Library Services for Search Toolkit Research Help I.T. Help
Locations

Home » Papers & Presentations » An Introduction to Metadata

Paper written by Chris Taylor


Manager, Information Access Service
University of Queensland Library
c.taylor@library.uq.edu.au
Revised: 29 July 2003

1. What is Metadata?

Metadata is structured data which describes the characteristics of a resource. It shares many similar
characteristics to the cataloguing that takes place in libraries, museums and archives. The term "meta" derives
from the Greek word denoting a nature of a higher order or more fundamental kind. A metadata record consists of
a number of pre-defined elements representing specific attributes of a resource, and each element can have one
or more values. Below is an example of a simple metadata record:

Element name Value

Title Web catalogue

Creator Dagnija McAuliffe

Publisher University of Queensland Library

Identifier http://www.library.uq.edu.au/iad/mainmenu.html

Format Text/html

Relation Library Web site

Each metadata schema will usually have the following characteristics:

 a limited number of elements


 the name of each element
 the meaning of each element

Typically, the semantics is descriptive of the contents, location, physical attributes, type (e.g. text or image, map
or model) and form (e.g. print copy, electronic file). Key metadata elements supporting access to published
documents include the originator of a work, its title, when and where it was published and the subject areas it
covers. Where the information is issued in analog form, such as print material, additional metadata is provided to
assist in the location of the information, e.g. call numbers used in libraries. The resource community may also
define some logical grouping of the elements or leave it to the encoding scheme. For example, Dublin Core may
provide the core to which extensions may be added.

Some of the most popular metadata schemas include:

 Dublin Core
 AACR2 (Anglo-American Cataloging Rules)
 GILS (Government Information Locator Service)
 EAD (Encoded Archives Description)
 IMS (IMS Global Learning Consortium)

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 2 of 9

 AGLS (Australian Government Locator Service)

While the syntax is not strictly part of the metadata schema, the data will be unusable, unless the encoding
scheme understands the semantics of the metadata schema. The encoding allows the metadata to be processed
by a computer program. Important schemes include:

 HTML (Hyper-Text Markup Language)


 SGML (Standard Generalised Markup Language)
 XML (eXtensible Markup Language)
 RDF (Resource Description Framework)
 MARC (MAchine Readable Cataloging)
 MIME (Multipurpose Internet Mail Extensions)

Metadata may be deployed in a number of ways:

 Embedding the metadata in the Web page by the creator or their agent using META tags in the HTML coding
of the page
 As a separate HTML document linked to the resource it describes
 In a database linked to the resource. The records may either have been directly created within the database
or extracted from another source, such as Web pages.

The simplest method is for Web page creators to add the metadata as part of creating the page. Creating
metadata directly in a database and linking it to the resource, is growing in popularity as an independent activity
to the creation of the resources themselves. Increasingly, it is being created by an agent or third party,
particularly to develop subject-based gateways.

2. What is a search engine?

In a nutshell, search engines, such as Google and HotBot, consist of a software package that crawls the Web,
extracts and organises the data in a database. People can then submit a search query using a Web browser. The
search engine locates the appropriate data in the database and displays it via the browser. This is not to be
confused with directories such as Yahoo, that provide subject lists created by humans, that must be browsed. Of
course, the Web being the Web, things chnage very rapidly. For example, in October 2002, Yahoo made a giant
shift to using Google's crawler-based listings for its main results. Nonetheless, search engines have three major
elements:

 The spider, also called the crawler, harvester, robot or gatherer. The spider visits a Web page, reads it, and
then follows links to other pages within the site. The spider returns to the site on a regular basis, such as
every month or two, to look for changes.
 The index. Everything the spider finds goes into the index. The index, is like a giant book containing a copy
of every web page that the spider finds. If a web page changes, then this book is updated with new
information.
 Search engine software. This is the program that sifts through the millions of pages recorded in the index to
find matches to a search and rank them in order of what it believes is most relevant.

Search engine software is also available to run on a local Web site. The software has the same basic components,
but the spider just visits the local site or a limited number of sites in a community.

3. Why isn't an Internet search engine good enough?

The problem relates to the underlying nature of the World Wide Web. In the early 1990s, "surfing" the World Wide
Web was popularised in the mass media. These days, the concept of browsing the Web is little used. The Web has
become a two-edged sword. It is now very easy to publish information, but it is becoming more difficult to find
relevant information [EC, p.4]. For outsiders and casual users, much of the useful material is difficult to locate
and therefore is effectively unavailable [DC1, p.2].

At the global level, Internet search engines were developed to search across multiple Web sites. Unfortunately,
these search engines have not been the panacea that some people had hoped for. Every search engine will give
you good results some of the time and bad results some of the time. This is what information scientists term "high
recall" and "low precision". The high recall refers to the well known (and frustrating) experience of using an
Internet search engine and receiving thousands of "hits". It is popularly known as information overload. The low
precision refers to not being able to locate the most useful documents. The search engine companies do not view
the high hit rates as a problem. Indeed, they market their products on the basis of their coverage of the Web, not
in the precision of the search results.

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 3 of 9

The Working Group on Government Information Navigation outlined the problems with Internet search engines:

 relevant information can be missed because sites contain types of resource in addition to HTML text (e.g.
images, databases, PDF documents);
 the search engines frequently do not harvest every page on a site, but often only the top two or three
hierarchical levels, thus missing significant documents which, on larger and more complex sites, may be
located in lower levels of the hierarchy;
 search engines, especially the more comprehensive ones, may index sites on an infrequent basis and may
therefore not contain the most current data; and
 irrelevant information can be retrieved because the search engine has no means (or very few means) of
distinguishing between important and incidental words in the document text. [WGGIN, p.2]

The introduction of the <META> element as part of HTML coding, was in part, an attempt to encourage search
engines to extract and index more structured data, such as description and keywords. However, search engines
are rather proprietorial in recognising <META> tags. It ranges from no support at all, to reasonable. Details are
available from Search Engine Watch [SEW]. As far as I am aware, none currently supports metadata schemas. It
is the proverbial "chicken and the egg" situation. Web page authors and publishers do not invest in providing
metadata if the indexing services do not utilise it and harvesters do not collect metadata if there is not enough
data available. The other problem is the malicious "spoofing" of search engines, making them return pages that
are irrelevant to the search at hand or pages that rank higher than their content warrants.

Support for <META> tags by search engines designed for local Web servers varies from non-existent to good.
Some of the specialist packages include support for Dublin Core or other metadata schemas.

4. Why use metadata?

The foregoing section has discussed the inadequacy of search engines in locating quality information resources.
How does metadata solve the problem? A more formal definition of metadata offers a clue:

Metadata is data associated with objects which relieves their potential users of having full advance
knowledge of their existence or characteristics. [DESIRE, p.2]

Information resources must be made visible in a way that allows people to tell whether the resources are likely to
be useful to them. This is no less important in the online world, and in particular, the World Wide Web. Metadata is
a systematic method for describing resources and thereby improving access to them. If a resource is worth
making available, then it is worth describing it with metadata, so as to maximise the ability to locate it.

Metadata provides the essential link between the information creator and the information user.

While the primary aim of metadata is to improve resource discovery, metadata sets are also being developed for
other reasons, including:

 administrative control
 security
 personal information
 management information
 content rating
 rights management
 preservation

While this document concentrates on resource discovery and retrieval, these additional purposes for metadata
should also be kept in mind.

5. Which Metadata schema?

There are literally hundreds of metadata schemas to choose from and the number is growing rapidly, as different
communities seek to meet the specific needs of their members.

Recognising the need to answer the question of how can a simple metadata record be defined that sufficiently
describes a wide range of electronic documents, the Online Computer Library Center (OCLC) of which the
University of Queensland Library is currently the only full member in Australia, combined with the National Center
for Supercomputing Applications (NCSA) to sponsor the first Metadata Workshop in March, 1995 in Dublin, Ohio
[DC1]. The primary outcome of the workshop was a set of 13 elements (subsequently increased to 15) named the
Dublin Metadata Core Element Set (known as Dublin Core). Dublin Core was proposed as the minimum number of
metadata elements required to facilitate the discovery of document-like objects in a networked environment such

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 4 of 9

as the Internet.

Below is a summary of the elements in Dublin Core. The metadata elements fall into three groups which roughly
indicate the class or scope of information stored in them: (1) elements related mainly to the content of the
resource, (2) elements related mainly to the resource when viewed as intellectual property, and (3) elements
related mainly to the physical manifestation of the resource.

Content & about Intellectual Electronic or


the Resource Property Physical
manifestation

Title Author or Creator Date

Subject Publisher Type

Description Contributor Format

Source Rights Identifier

Language

Relation

Coverage

A description of each element is given in Appendix 1. Below is an example of a Dublin Core record for a short
poem, encoded as part of a Web page using the <META> tag:

<HTML> !4.0!
<HEAD>
<TITLE>Song of the Open Road</TITLE>
<META NAME="DC.Title" CONTENT="Song of the Open Road">
<META NAME="DC.Creator" CONTENT="Nash, Ogden">
<META NAME="DC.Type" CONTENT="text">
<META NAME="DC.Date" CONTENT="1939">
<META NAME="DC.Format" CONTENT="text/html">
<META NAME="DC.Identifier" CONTENT="http://www.poetry.com/nash/open.html">
</HEAD>
<BODY><PRE>
I think that I shall never see
A billboard lovely as a tree.
Indeed, unless the billboards fall
I'll never see a tree at all.
</PRE></BODY>
</HTML>

The <META> tag is not normally displayed by Web browsers, but can be viewed by selecting "Page Source".

In addition to the 15 elements, three qualifying aspects have been accepted to enable the Dublin Core to function
in an international context and also meet higher level scientific and subject-specific resource discovery needs.
These three Dublin Core Qualifiers are:

 LANG: indicating the language of the contents of the element, to be used in both resource discovery and in
filtering retrieval results

 SCHEME: indicating the set of regulations, standards, conventions or norms from which a term in the
content of the element has been taken

 SUB-ELEMENT: refinement of some of the elements to gain more precision

6. Why Dublin Core?

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 5 of 9

The Dublin Core metadata schema offers the following advantages:

 Its useability and its flexibility


 The semantics of these elements is designed to be clear enough to be understood by a wide range of
customers, without the need for training
 The elements of Dublin Core are easily identifiable by having the work in hand, such as intellectual content
and physical format
 It is not intended to supplant other resource descriptions, but rather to complement them. It is intended to
describe the essential features of electronic documents that support resource discovery. Other important
metadata such as accounting and archival data, were deliberately excluded to keep the schema as simple
and useable as possible.
 It is mostly syntax independent, to support its use in the widest range of applications
 All elements are optional, but allows each site to define which elements are mandatory and which are
optional
 All elements are repeatable
 The elements may be modified in limited and well-defined ways through the use of specific qualifiers, such
as the name of the thesaurus used in the subject element
 It can be extended to meet the demands of more specialised communities. From the very beginning, the
Dublin Core creators recognised that some resources could not be adequately described by a small set of
elements. The Dublin Core creators came up with two solutions. Firstly, by allowing the addition of elements
for site-specific purposes or specialised fields. Secondly, by designing the Dublin Core schema so that it
could be mapped into more complex and tightly controlled systems, such as MARC.

Dublin Core has received widespread acceptance amongst the resource discovery community and has become the
defacto Internet metadata standard [AGLS, p.3].

To date, the depth of implementation in individual sectors has been patchy. In Australia, much activity has taken
place in the government sector, under the auspices of the Government Technology and Telecommunications
Committee (GTTC). Dublin Core has been formally accepted as the standard for the Australian Government
Locator Service [AGLS].

7. Which elements, sub-elements and schemes should I use?

There is no simple answer to this question. At a fundamental level, it becomes a compromise, based on:

 the specific needs of the local community to maximise information retrieval and management

 the need to guard against making the creation of metadata and its maintenance more trouble than it is
worth and therefore defeating its purpose
 sustainability of the metadata schema in terms of keeping the records up to date

The bottom-line is that a simple description is better than no description at all, as long as it can aid in the
consistent discovery of resources.

The level of specificity in resource description is also important. The resources can be described individually or at a
collection or aggregate level. It would be practically impossible to provide guidelines as to the appropriate level of
specificity. Cataloguing librarians have been arguing the toss for years without reaching a consensus. As always,
we should think in terms of customer needs. As noted above, with the major search engines, it is possible to have
too many records, such that our customers can't see the forest for the trees. Initially, it would be sensible to allow
the creators to determine which resources deserve their own record. If a collection-level record is used, it is
important to add as much information as possible to ensure appropriate retrieval.

Acting on customer feedback is also important. Monitoring the search terms input by customers, is a well proven
technique for improving the quality and coverage of a database. The downside is that the assessment process is
essentially a manual one.

8. What about using controlled terminology?

Consistent use of language with metadata descriptions can aid in the consistent discovery of resources. The
primary tool for ensuring consistent language usage is via controlled vocabulary, including the use of thesauri. A
number of metadata elements would benefit from controlled values.

There are many subject thesauri available. However, most are designed for specialist resource communities. For
example, the Edinburgh Engineering Virtual Library (EEVL) originally selected the Engineering Information
thesaurus, but decided that it was too complex for the purpose. Instead they developed a modified version to suit

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 6 of 9

their specific needs.

Ultimately, as the AGLS Metadata Element Set notes, "… a common sense, author-based approach is still effective
and yields a high return to agencies." [AGLS1].

In the absence of a suitable subject thesaurus, some may be tempted to create one from scratch. This temptation
is to be resisted at all cost. History is studded with failed attempts at developing new thesauri. Its like establishing
a small business. People don't seem to understand that starting is easy, finding the resources to keep the
thesaurus current is the real trick. Keeping a thesaurus up to date is a huge investment in resources that is very
difficult to justify.

While strictly not a metadata issue, the mismatch between input and index terms has proven to be a major
problem in retrieval from databases, particularly as a result of semantic problems, such as different spellings,
singular and plural, etc. Although the basic query interfaces for search engines seem similar, there are important
differences that affect the outcome of the search. For example, the query 'Mabo Legislation' could be interpreted
by different engines as requesting resources that contain:

 the words 'Mabo' and 'legislation';


 either of the words 'Mabo' or 'legislation';
 the expression 'Mabo legislation' as a single unit.

Obviously, these three different interpretations will produce different sets of results. Search engines differ in
whether queries are case sensitive and how they handle singular versus plural forms of a word. Alternative
spellings, for example, labour and labor, may have to be searched separately. The same applies to abbreviations,
such as dept and department. This disconcerts the naive user and annoys the experienced user. One solution is to
use a common query interface, or an intermediate query engine which takes a standard query and translates it
into the specific forms required by the site search engine.

9. Where will the metadata be stored?

Metadata may be deployed in a number of ways:

 Embedding the metadata in the Web page by the creator or their agent using META tags in the HTML coding
of the page
 As a separate HTML document linked to the resource it describes
 In a database linked to the resource. The records may either have been directly created within the database
or extracted from another source, such as Web pages.

The simplest method is to ask Web page creators to add the metadata as part of creating the page. To support
rapid retrieval, the metadata should be harvested on a regular basis by the site robot. This is currently by far the
most popular method for deploying Dublin Core. An increasing range of software is being made available to assist
in the addition of metadata to Web pages.

Creating metadata directly in a database and linking it to the resource, is growing in popularity as an independent
activity to the creation of the resources themselves. Increasingly, it is being created by an agent or third party,
particularly to develop subject-based gateways. The University of Queensland Library is involved in a number of
gateway projects, including AVEL and Weblaw.

10. Syntax Issues

For metadata attached to Web pages, the standard encoding scheme is HTML (HyperText Markup Language). RDF
(Resource Description Framework) supports multiple metadata schemes. It uses XML (EXtensible Markup
Language) to express the structure. The advantages in using RDF/XML are many:

 Separates data management from data presentation, making both processes more efficient
 can handle multiple metadata schemas in the one record
 easier for computers to understand
 can group elements
 supports complex values
 multilingual support

Its major drawback is that user-friendly tools to generate XML are still scarce. For metadata contained within a
database, the encoding scheme is a lesser issue. What is important is its interroperability with other database
schemas, to support cross-database searching and the sharing of metadata records.

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 7 of 9

In the context of Web indexing, there are currently two Webs in existence. The first is the "visible" Web, made up
of static Web pages that can be harvested and indexed. The second is the "invisible" Web, made up of dynamic
pages generated from a database. These pages can’t be directly harvested by a robot and indexed. The records
have to exported from the database, not always a trivial matter. Even if they could be harvested, the amount of
data in a single, centralised database would be unmanageable.

One option is to interrogate multiple databases at the same time. There are proprietorial systems that can do this,
usually at great expense. Individual systems can also talk to one another if they conform to the US National
Information Standards Organization (NISO) Z39.50 protocol [NISO]. The Z39.50 protocol for distributed
information retrieval, supports the searching of disparate databases, either singularly or in combination,
regardless of proprietorial interfaces. Z39.50 supports a number of "profiles" in order to enable translation
between various databases. Unfortunately, few databases and local search engines support Z39.50.

A more recent development in federated searching is the increasing availability of portal-type software [LC] that
supports a single search across multiple databases. The actual techniques remain highly parochical, but in essence
it relies on using client software to simultaneous interrogate the indexes of a number of databases, with the
results being normalised for display by employing a locally defined metadata schema (usually DC). More
sophisticated versions use some type of record de-deuplication techniques.

Such software is achieving relatively quick penetration in the Library marketplace. This is partially due to fact that
the software has been largely developed by library system vendors seeking to broaden their marketplace. It has
also come about as a result of librarians de-crying the "search interface wars", i.e. there are just too many
database search interfaces for librarians and their clients to learn. Such solutions do not come cheap, however.

The other recent development is the Open Achives Initiative [OAI], which seeks to harvest standards-based
metadata (DC is the minimum standard) to build metadata repositories.

11. How does one create metadata?

The more easily the metadata can be created and collected at point of creation of a resource or at point of
publication, the more efficient the process and the more likely it is to take place. There are many such tools
available and the number continues to grow. Such tools can be standalone or part of a package of software,
usually with a backend database or repository to store and retrieve the metadata records, Some examples
include:

 DC-dot - http://www.ukoln.ac.uk/metadata/dcdot/. This service will retrieve a Web page and automatically
generate Dublin Core metadata, either as HTML tags or as RDF/XML, suitable for embedding in the section
of the page.
 DCmeta - http://www.dstc.edu.au/RDU/MetaWeb/generic_tool.html. Developed by Tasmania Online. It is
based on SuperNoteTab text-editor and can be customised.
 HotMeta - http://www.dstc.edu.au/Research/Projects/hotmeta/. A package of software, including metadata
editor, repository and search engine.

Ideally, metadata should be created using a purpose-built tool, with the manual creation of data kept to an
absolute minimum. The tool should support:

 Inclusion of the syntax in the template (e.g. element name, sub-element, qualifier)
 Default content, which can be overridden
 Content selected from a list of limited choices (e.g. Function, Type, Format)
 Validation of mandatory elements, sub-elements, schemes and element values

References

[AGLS] Australian Government Locator Service Implementation Plan: A Report by the Australian
Government Locator Service Working Party (AGLS WG) December, 1997.

[AGLS1] AGLS Metadata Element Set. National Archives of Australia.


http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html

[DC1] The Essential Elements of Networked Object Description. Stuart Weibel. OCLC/NCSA Metadata
Workshop, March, 1995. http://www.oclc.org:5046/oclc/research/metadata/dublin_core_report.html

[DESIRE] Specification for resource description methods Part 1: A review of metadata: a survey of current
resource description formats. Lorcan Dempsey and Rachel Heery, March,1997.

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 8 of 9

http://www.ukoln.ac.uk/metadata/desire/overview/

[EC] Metadata Workshop. European Commission, Telematics for Libraries, December, 1997.
http://hosted.ukoln.ac.uk/ec/metadata-1997/

[LC] The Library of Congress Portals Applications Issues Group.http://www.loc.gov/catdir/lcpaig/

[NISO] Information Retrieval (Z39.50) - Application Service: Definition and Protocol Specification (Version
3). The National Information Standards Organization, 1995. http://www.niso.org/

[OAI] Open Archives Initiative. http://www.openarchives.org/

[SEW] Search Engine Watch. Search Engine Feature Company.


http://searchenginewatch.com/webmasters/features.html

[WGGIN] Improving Access to Information and Services of Australian Governments. Working Group on
Government Information Navigation, July, 1997. http://www.nla.gov.au/lis/esd4.html

Appendix 1: Dublin Core Metadata schema

Element Element description

Creator Person or organisation primarily responsible for creating the


intellectual content of the resource, e.g. authors in the case of written
documents, artists, photographers, etc. in the case of visual
resources.

Publisher The entity (e.g. agency including unit/branch/section) responsible for


making the resource available in its present form, such as a publishing
house, a university department, or a corporate entity.

Contributor Person or organisation not specified in a Creator element who has


made significant intellectual contributions to the resource but whose
contribution is secondary to any person or organisation specified in a
Creator element, e.g. editor, transcriber, illustrator.

Rights A rights management statement, an identifier that links to a rights


Management management statement.

Title The name given to the resource, usually by the creator or publisher.
Can be the same as the title of the resource, or may be more
descriptive

Subject The topic of the resource. Typically, will be expressed as keywords or


phrases that describe the subject or content of the resource.
Controlled vocabularies and formal classification schemes are
encouraged.

Date A date associated with the creation or availability of the resource.

Identifier A string or number used to uniquely identify the resource. Examples


for networked resources include URLs, Purls and URNs. ISBN or other
formal names can be used.

Description A textual description of the content of the resource, including


abstracts in the case of document-like objects or content descriptions
in the case of visual resources.

Source The work, either print or electronic, from which this object is derived,
if applicable. Source is not applicable if the present resource is in its
original form.

Language The language of the intellectual content of the resource.

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009
An Introduction to Metadata Page 9 of 9

Relation Relationship to other resources, e.g. images in a document, chapters


in a book, items in a collection.

Coverage Spatial locations and temporal duration characteristic of the resource.

Type The category of the resource, such as home page, novel, poem,
working paper, technical report, essay, dictionary.

Format The data format of the resource, used to identify the software and
possibly hardware that might be needed to display or operate the
resource, e.g. postscript, HTML, text, jpeg, XML.

my.SI-net | eLearning/Blackboard | Feedback & suggestions


©2007 The University of Queensland, Brisbane Australia
ABN 63 942 912 684
CRICOS Provider Number: 00025B
Authorised by: University Librarian
Maintained by: UQ Library
Last Updated: 30 August 2007.

http://www.library.uq.edu.au/iad/ctmeta4.html 12/23/2009

You might also like