You are on page 1of 45

Pgina 1 de 45

Creating and Documenting Electronic Texts: A Guide to Good Practice


by Alan Morrison , Michael Popham , Karen Wikander
Chapter 1: Introduction......................................................................................................................... 3
1.1: Aims and organisation of this Guide................................................................................................................ 3
1.2: What this Guide does not cover, and why....................................................................................................... 3
1.3: Opening questions Who will read your text, why, and how? ...................................................................... 4
Chapter 2: Document Analysis ............................................................................................................ 4
2.1: What is document analysis?............................................................................................................................ 4
2.2: How should I start?.......................................................................................................................................... 4
2.2.1: Project objectives ........................................................................................................................................................ 4
2.2.2: Document context........................................................................................................................................................ 5
2.3: Visual and structural analysis.......................................................................................................................... 6
2.4: Typical textual features ................................................................................................................................... 7
Chapter 3: Digitization Scanning, OCR, and Re-keying ................................................................ 8
3.1: What is digitization? ........................................................................................................................................ 8
3.2: The digitization chain....................................................................................................................................... 8
3.3: Scanning and image capture........................................................................................................................... 9
3.3.1: Hardware Types of scanner and digital cameras..................................................................................................... 9
3.3.2: Software .................................................................................................................................................................... 10
3.4: Image capture and Optical Character Recognition (OCR) ............................................................................ 11
3.4.1: Imaging issues........................................................................................................................................................... 11
3.4.2: OCR issues ............................................................................................................................................................... 14
3.5: Re-keying ...................................................................................................................................................... 15
Chapter 4: Markup: The key to reusability........................................................................................ 16
4.1: What is markup? ........................................................................................................................................... 16
4.2: Visual/presentational markup vs. structural/descriptive markup ................................................................... 16
4.2.1: PostScript and Portable Document Format (PDF) ..................................................................................................... 17
4.2.2: HTML 4.0................................................................................................................................................................... 17
4.2.3: User-definable descriptive markup............................................................................................................................. 19
4.3: Implications for long-term preservation and reuse........................................................................................ 19
Chapter 5: SGML/XML and TEI........................................................................................................... 19
5.1: The Standard Generalized Markup Language (SGML) ................................................................................ 19
5.1.1: SGML as metalanguage............................................................................................................................................ 20
5.1.2: The SGML Document ................................................................................................................................................ 21
5.1.3: Creating Valid SGML Documents .............................................................................................................................. 23
5.1.4: XML: The Future for SGML........................................................................................................................................ 24
5.2: The Text Encoding Initiative and TEI Guidelines .......................................................................................... 27
5.2.1: A brief history of the TEI ............................................................................................................................................ 27
5.2.2: The TEI Guidelines and TEI Lite................................................................................................................................ 28
Chapter 6 : Documentation and Metadata......................................................................................... 30
6.1 What is Metadata and why is it important?..................................................................................................... 30
6.1.1: Conclusion and current developments....................................................................................................................... 31
6.2 The TEI Header.............................................................................................................................................. 32
6.2.1: The TEI Lite Header Tag Set ..................................................................................................................................... 33
6.2.2 The TEI Header: Conclusion....................................................................................................................................... 35
6.3 The Dublin Core Element Set and the Arts and Humanities Data Service .................................................... 36
6.3.1 Implementing the Dublin Core .................................................................................................................................... 37
6.3.2 Conclusions and further reading................................................................................................................................. 37
6.3.3 The Dublin Core Elements.......................................................................................................................................... 37
Chapter 7: Summary ........................................................................................................................... 40
Step 1: Sort out the rights..................................................................................................................................... 40
Step 2: Assess your material................................................................................................................................ 40
Step 3: Clarify your objectives.............................................................................................................................. 40
Pgina 2 de 45
Step 4: Identify the resources available to you and any relevant standards........................................................ 40
Step 5: Develop a project plan ............................................................................................................................. 41
Step 6: Do the work!............................................................................................................................................. 41
Step 7: Check the results ..................................................................................................................................... 41
Step 8: Test your text ........................................................................................................................................... 41
Step 9: Prepare for preservation, maintenance, and updating............................................................................. 41
Step 10: Review and share what you have learned............................................................................................. 41
Bibliography ........................................................................................................................................ 41
Glossary............................................................................................................................................... 43
Chapter 1: Introduction
1.1: Aims and organisation of this Guide
1.2: What this Guide does not cover, and why
1.3: Opening questions Who will read your text, why, and how?
Chapter 2: Document Analysis
2.1: What is document analysis?
2.2: How should I start?
2.3: Visual and structural analysis
2.4: Typical textual features
Chapter 3: Digitization Scanning, OCR, and Re-keying
3.1: What is digitization?
3.2: The digitization chain
3.3: Scanning and image capture
3.4: Image capture and Optical Character Recognition (OCR)
3.5: Re-Keying
Chapter 4: Markup: The key to reusability
4.1: What is markup?
4.2: Visual/presentational markup vs. structural/descriptive markup
4.3: Implications for long-term preservation and reuse
Chapter 5: SGML/XML and TEI
5.1: The Standard Generalized Markup Language (SGML)
5.2: The Text Encoding Initiative and TEI Guidelines
5.3: Where to find out more about SGML/XML and the TEI
Chapter 6 : Documentation and Metadata
6.1 What is Metadata and why is it important?
6.2 The TEI Header
6.3 The Dublin Core Element Set and the Arts and Humanities Data Service
Chapter 7: Summary
Step 1: Sort out the rights
Step 2: Assess your material
Step 3: Clarify your objectives
Step 4: Identify the resources available to you and any relevant standards
Step 5: Develop a project plan
Step 6: Do the work!
Step 7: Check the results
Step 8: Test your text
Step 9: Prepare for preservation, maintenance, and updating
Step 10: Review and share what you have learned
Bibliography
Glossary
Pgina 3 de 45
Chapter 1: Introduction
1.1: Aims and organisation of this Guide
The aim of this Guide is to take users through the basic steps involved in creating and
documenting an electronic text or similar digital resource. The notion of 'electronic text' is interpreted very
broadly, and discussion is not limited to any particular discipline, genre, language or period although where
space permits, issues that are especially relevant to these areas may be drawn to the reader's attention.
The authors have tended to concentrate on those types of electronic text which, to a greater or
lesser extent, represent a transcription (or, if you prefer, a 'rendition', or 'encoding') of a non-electronic
source, rather than the category of electronic texts which are primarily composed of digitized images of a
source text (e.g. digital facsimile editions). However, there are a growing number of electronic textual
resources which support both these approaches; for example some projects involving the digitization of rare
illuminated manuscripts combine high-quality digital images (for those scholars interested in the appearance of
the source) with electronic text transcriptions (for those scholars concerned with analysing aspects of the
content of the source). We would hope that the creators of every type of electronic textual resource will find
something of interest in this short work, especially if they are newcomers to this area of intellectual and
academic endeavour.
This Guide assumes that the creators of electronic texts have a number of common concerns. For
example, that they wish their efforts to remain viable and usable in the long-term, and not to be unduly
constrained by the limitations of current hardware and software. Similarly, that they wish others to be able to
reuse their work, for the purposes of secondary analysis, extension, or adaptation. They also want the tools,
techniques, and standards that they adopt to enable them to capture those aspects of any non-electronic
sources which they consider to be significant whilst at the same time being practical and cost-effective to
implement.
The Guide is organised in a broadly linear fashion, following the sequence of actions and decisions
which we would expect any electronic text creation project to undertake. Not every electronic text creator
will need to consider every stage, but it may be useful to read the Guide through once, if only to establish the
most appropriate course of action for one's own work.
1.2: What this Guide does not cover, and why
Creating and processing electronic texts was one of the earliest areas of computational activity,
and has been going on for at least half a century. This Guide does not have any pretence to be a comprehensive
introduction to this complex area of digital resource creation, but the authors have attempted to highlight
some of the fundamental issues which will need to be addressed particularly by anyone working within the
community of arts and humanities researchers, teachers, and learners, who may never before have undertaken
this kind of work.
Crucially, this Guide will not attempt to offer a comprehensive (or even a comparative) overview
of the available hardware and software technologies which might form the basis of any electronic text
creation project. This is largely because the development of new hardware and software continues at such a
rapid pace that anything we might review or recommend here will probably have been superseded by the time
this publication becomes available in printed form. Similarly, there would have been little point in providing
detailed descriptions of how to combine particular encoding or markup schemes, metadata, and delivery
systems, as the needs and abilities of the creators and (anticipated) users of an electronic text should be the
major factors influencing its design, construction, and method of delivery.
Instead, the authors have attempted to identify and discuss the underlying issues and key
concerns, thereby helping readers to begin to develop their own knowledge and understanding of the whole
subject of electronic text creation and publication. When combined with an intimate knowledge of the non-
electronic source material, readers should be able to decide for themselves which approach and thus which
combinations of hardware and software, techniques and design philosophy will be most appropriate to their
needs and the needs of any other prospective users.
Although every functional aspect of computers is based upon the distinctive binary divide
evidenced between 1's and 0's, true and false, presence and absence, it is rarely so easy to draw such clear
distinctions at the higher levels of creating and documenting electronic texts. Therefore, whilst reading this
Guide it is important to remember that there are seldom 'right' or 'wrong' ways to prepare an electronic text,
Pgina 4 de 45
although certain decisions will crucially affect the usefulness and likely long-term viability of the final
resource. Readers should not assume that any course of action recommended here will necessarily be the
'best' approach in any or all given circumstances; however everything the authors say is based upon our
understanding of what constitutes good practice and results from almost twenty-five years of experience
running the Oxford Text Archive (http://ota.ahds.ac.uk).
1.3: Opening questions Who will read your text, why, and how?
There are some fundamental questions that will recur throughout this Guide, and all of them
focus upon the intended readership (or users) of the electronic text that you are hoping to produce. For
example, if your main reason for creating an electronic text is to provide the raw data for computer-assisted
analysis perhaps as part of an authorship attribution study then completeness and accuracy of the data
will probably be far more important than capturing the visual appearance of the source text. Conversely, if you
are hoping to produce an electronic text that will have broad functionality and appeal, and the original source
contains presentational features which might be considered worthy of note, then you should be attempting to
create a very different object perhaps one where visual fidelity is more important than the absolute
accuracy of any transcription. In the former case, the implicit assumption is that no-one is likely to read the
electronic text (data) from start to finish, whilst in the second case it is more likely that some readers may
wish to use the electronic text as a digital surrogate for the original work. As the nature of the source(s)
and/or the intended resource(s) becomes more complex for example recording variant readings of a
manuscript or discrepancies between different editions of the same printed text the same fundamental
questions remain.
The first chapter of this Guide looks at how you might start to address some of these questions,
by subjecting your source(s) to a process that the creators of electronic texts have come to call 'Document
Analysis'.
Chapter 2: Document Analysis
2.1: What is document analysis?
Deciding to create an electronic text is just like deciding to begin any other type of construction
project. While the desire to dive right in and begin building is tempting, any worthwhile endeavour will begin
with a thorough planning stage. In the case of digitized text creation, this stage is called document analysis.
Document analysis is literally the task of examining the physical object in order to acquire an understanding
about the work being digitized and to decide what the purpose and future of the project entails. The
digitization of texts is not simply making groups of words available to an online community; it involves the
creation of an entirely new object. This is why achieving a sense of what it is that you are creating is critical.
The blueprint for construction will allow you to define the foundation of the project. It will also allow you to
recognise any problems or issues that have the potential to derail the project at a later point.
Document analysis is all about definition defining the document context, defining the document
type and defining the different document features and relationships. At no other point in the project will you
have the opportunity to spend as much quality time with your document. This is when you need to become
intimately acquainted with the format, structure, and content of the texts. Document analysis is not limited to
physical texts, but as the goal of this guide is to advise on the creation of digital texts from the physical
object this will be the focus of the chapter. For discussions of document analysis on objects other than text,
please refer to such studies as Yale University Library Project Open Book
(http://www.library.yale.edu/preservation/pobweb.htm), the Library of Congress American Memory Project
and National Digital Library Program (http://lcweb2.loc.gov/) , and Scoping the Future of Oxford's Digital
Collections (http://www.bodley.ox.ac.uk/scoping/).
2.2: How should I start?
2.2.1: Project objectives
One of the first tasks to perform in document analysis is to define the goals of the project and
the context under which they are being developed. This could be seen as one of the more difficult tasks in the
document analysis procedure, as it relies less upon the physical analysis of the document and more upon the
theoretical positions taken with the project. This is the stage where you need to ask yourself why the
document is being encoded. Are you looking simply to preserve a digitized copy of the document in a format
Pgina 5 de 45
that will allow an almost exact future replication? Is your goal to encode the document in a way that will assist
in a linguistic analysis of the work? Or perhaps there will be a combination of structural and thematic encoding,
so that users will be able to perform full-text searches of the document? Regardless of the choice made, the
project objectives must be carefully defined, as all subsequent decisions hinge upon them.
It is also important to take into consideration the external influences on the project. Often the
bodies that oversee digitization projects, either in a funding or advisory capacity, have specific conditions that
must be fulfilled. They might for example have markup requirements or standards (linguistic, TEI/SGML, or
EAD perhaps) that must be taken into account when establishing an encoding methodology. Also, if you are
creating the electronic text for scholarly purposes, then it is very likely that the standards of this community
will need to be adhered to. Again, it must be remembered that the electronic version of a text is a distinct
object and must be treated as such. Just as you would adhere to a publishing standard of practice with a
printed text, so must you follow the standard for electronic texts. The most stringent scholarly community,
the textual critics and bibliographers, will have specific, established guidelines that must be considered in
order to gain the requisite scholarly authority. Therefore, if you were creating a text to be used or approved
by this community their criteria would have to be integrated into the project standards, with the subsequent
influence on both the objectives and the creative process taken into account. If the digitization project
includes image formats, then there are specific archiving standards held by the electronic community that
might have to be met this will not only influence the purchase of hardware and software, but will have an
impact on the way in which the electronic object will finally be structured. External conditions are easily
overlooked during the detailed analysis of the physical object, so be sure that the standards and policies that
influence the outcome of the project are given serious thought, as having to modify the documents
retrospectively can prove both detrimental and expensive.
This is also a good time to evaluate who the users of your project are likely to be. While you
might have personal goals to achieve with the project perhaps a level of encoding that relates to your own
area of expertise many of the objectives will relate to your user base. Do you see the work being read by
secondary school pupils? Undergraduates? Academics? The general public? Be prepared for the fact that
every user will want something different from your text. While you cannot satisfy each desire, trying to
evaluate what information might be the most important to your audience will allow you to address the needs
and concerns you deem most appropriate and necessary. Also, if there are specific objectives that you wish
users to derive from the project then this too needs to be established at the outset. If the primary purpose
for the texts is as a teaching mechanism, then this will have a significant influence on how you choose to
encode the document. Conversely, if your texts are being digitized so that users will be able to perform
complex thematic searches, then both the markup of content and the content of the markup will differ
somewhat. Regardless of the decision, be sure that the outcome of this evaluation becomes integrated with
the previously determined project objectives.
You must also attempt to assess what tools users will have at their disposal to retrieve your
document. The hardware and software capabilities of your users will differ, sometimes dramatically, and will
most likely present some sort of restriction or limitation upon their ability to access your project. SGML
encoded text requires the use of specialised software, such as Panorama, to read the work. Even HTML has
tagsets that early browsers may not be able to read. It is essential that you take these variants into
consideration during the planning stage. There might be priorities in the project that require accessibility for
all users, which would affect the methodology of the project. However, don't let the user limitations stunt the
encoding goals for the document. Hardware and software are constantly being upgraded so that although some
of the encoding objectives might not be fully functional during the initial stages of the project, they stand a
good chance of becoming accessible in the near future.
2.2.2: Document context
The first stage of document analysis is not only necessary for detailing the goals and objectives
of the project, but also serves as an opportunity to examine the context of the document. This is a time to
gather as much information as possible about the documents being digitized. The amount gathered varies from
project to project, but in an ideal situation you will have a complete transmission and publication history for
the document. There are a few key reasons for this. Firstly, knowing how the object being encoded was
created will allow you to understand any textual variations or anomalies. This, in turn, will assist in making
informed encoding decisions at later points in the project. The difference between a printer error and an
authorial variation not only affects the content of the document, but also the way in which it is marked up.
Pgina 6 de 45
Secondly, the depth of information gathered will give the document the authority desired by the scholarly
community. A text about which little is known can only be used with much hesitation. While some users might
find it more than acceptable for simply printing out or reading, there can be no authoritative scholarly analysis
performed on a text with no background history. Thirdly, a quality electronic text will have a TEI header
attached (see Chapter 6). The TEI header records all the information about the electronic text's print source.
The more information you know about the source, the more full and conclusive your header will be which will
again provide scholarly authority. Lastly, understanding the history of the document will allow you to
understand its physicality.
The physicality of the text is an interesting issue and one on which very few scholars fully
agree. Clearly, an understanding of the physical object provides a sense of the format, necessary for a proper
structural encoding of the text, but it also augments a contextual understanding. Peter Shillingsburg theorises
that the 'electronic medium has extended the textual world; it has not overthrown books nor the discipline of
concentrated "lines" of thought; it has added dimensions and ease of mobility to our concepts of textuality'
(Shillingsburg 1996, 164). How is this so? Simply put, the electronic medium will allow you to explore the
relationships in and amongst your texts. While the physical object has trained readers to follow a more linear
narrative, the electronic document will provide you with an opportunity to develop the variant branches found
within the text. Depending upon the decided project objectives, you are free to highlight, augment or furnish
your users with as many different associations as you find significant in the text. Yet to do this, you must fully
understand the ontology of the texts and then be able to delineate this textuality through the encoding of the
computerised object.
It is important to remember that the transmission history does not end with the publication of
the printed document. Tracking the creation of the electronic text, including the revision history, is a
necessary element of the encoding process. The fluidity of electronic texts precludes the guarantee that
every version of the document will remain in existence, so the responsibility lies with the project creator to
ensure that all revisions and developments are noted. While some of the documentation might seem tedious, an
electronic transmission history will serve two primary purposes. One, it will help keep the project creator(s)
aware of what has developed in the creation of the electronic text. If there are quite a few staff members
working on the documents, you will be able to keep track of what has been accomplished with the texts and to
check that the project methodology is being followed. Two, users of the documents will be able to see what
emendations or regularisations have been made and to track what the various stages of the electronic object
were. Again, this will prove useful to a scholarly community, like the textual critics, whose research is
grounded in the idea of textual transmission and history.
2.3: Visual and structural analysis
Once the project objectives and document context have been established, you can move on to an
analysis of the physical object. The first step is to provide the source texts with a classification. Defining the
document type is a critical part of the digitization process as it establishes the foundation for the initial
understanding of the text's structure. At this point you should have an idea of what documents are going to be
digitized for the project. Even if you not sure precisely how many texts will be in the final project, it is
important to have a representative sample of the types of documents being digitized. Examine the sample
documents and decide what categories they fall under. The structure and content of a letter will differ
greatly from that of a novel or poem, so it is critical to make these naming classifications early in the process.
Not only are there structural differences between varying document types but also within the same type. One
novel might consist solely of prose, while another might be comprised of prose and images, while yet another
might have letters and poetry scattered throughout the prose narrative. Having an honest representative
sample will provide you with the structural information needed to make fundamental encoding decisions.
Deciding upon document type will give you an initial sense of the shape of the text. There are
basic structural assumptions that come with classification: looking for the stanzas in poetry or the paragraphs
in prose for example. Having established the document type, you can begin to assign the texts a more detailed
structure. Without worrying about the actual tag names, as this comes later in the process, label all of the
features you wish to encode. For example, if you are digitizing a novel, you might initially break it into large
structural units: title page, table of contents, preface, body, back matter, etc. Once this is done you might
move on to smaller features: titles, heads, paragraphs, catchwords, pagination, plates, annotations and so
forth. One way to keep the naming in perspective is to create a structure outline. This will allow you to see how
Pgina 7 de 45
the structure of your document is developing, whether you have omitted any necessary features, or if you have
labelled too much.
Once the features to be encoded have been decided upon, the relationships between them can
then be examined. Establishing the hierarchical sequence of the document should not be too arduous a task
especially if you have already developed a structural outline. It should at this point be apparent, if we stick
with the example of a novel, that the work is contained within front matter, body matter, and back matter.
Within front matter we find such things as epigraphs, prologues, and title pages. The body matter is comprised
of chapters, which are constructed with paragraphs. Within the paragraphs can be found quotations, figures,
and notes. This is an established and understandable hierarchy. There is also a sequential relationship where
one element logically follows another. Using the above representation, if every body has chapters, paragraphs,
and notes, then you would expect to find a sequence of <chapter> then <paragraph> then <note>, not <chapter>,
<note>, then <paragraph>. Again, the more you understand about the type of text you are encoding, the easier
this process will be. While the level of structural encoding will ultimately depend upon the project objectives,
this is an opportune time to explore the form of the text in as much detail as possible. Having these data will
influence later encoding decisions, and being able to refer to these results will be much easier than having to
sift through the physical object at a later point to resolve a structural dilemma.
The analysis also brings to light any issues or problems with the physical document. Are parts of
the source missing? Perhaps the text has been water damaged and certain lines are unreadable? If the
document is a manuscript or letter perhaps the writing is illegible? These are all instances that can be
explored at an early stage of the project. While these problems will add a level of complexity to the encoding
project, they must be dealt with in an honest fashion. If the words of a letter are illegible and you insert text
that represents your best guess at the actual wording then this needs to be encoded. The beauty of document
analysis is that by examining the documents prior to digitization you stand a good chance of recognising these
issues and establishing an encoding methodology. The benefit of this is threefold: firstly, having identified and
dealt with this problem at the start you will have fewer issues arise during the digitization process; secondly,
there will be an added level of consistency during the encoding stage and retrospective revision won't be
necessary; thirdly, the project will benefit from the thorough level of accuracy desired and expected by the
scholarly community.
This is also a good time to examine the physical document and attempt to anticipate problems
with the digitization process. Fragile spines, flaking or foxed paper, badly inked text, all will create difficulties
during the scanning process and increase the likelihood of project delays if not anticipated at an early stage.
This is another situation that requires examining representative samples of texts. It could be that one text
was cared for in the immaculate conditions of a Special Collections facility while another was stored in a damp
corner of a bookshelf. You need to be prepared for as many document contingencies as possible. Problems not
only arise out of the condition of the physical object, but also out of such things as typography. OCR
digitization is heavily reliant upon the quality and type of fonts used in the text. As will be discussed in greater
detail in Chapter 3, OCR software is optimised for laser quality printed text. This means that the older the
printed text, the more degradation in the scanning results. These types of problems are critical to identify, as
decisions will have to be made about how to deal with them decisions that will become a significant part of
the project methodology.
2.4: Typical textual features
The final stage of document analysis is deciding which features of the text to encode. Once
again, knowing the goals and objectives of the project will be of great use as you try to establish the breadth
of your element definition. You have the control over how much of the document you want to encode, taking
into account how much time and manpower are dedicated to the project. Once you've made a decision about
the level of encoding that will go into the project, you need to make the practical decision of what to tag.
There are three basic categories to consider: structure, format and content.
In terms of structure there are quite a few typical elements that are encoded. This is a good
time to examine the structural outline to determine what skeletal features need to be marked up. In most
cases, the primary divisions of text chapters, sections, stanzas, etc. and the supplementary parts
paragraphs, lines, pages are all assigned tag names. With structural markup, it is helpful to know how
detailed an encoding methodology is being followed. As you will discover, you can encode almost anything in a
document, so it will be important to have established what level of markup is necessary and to then adhere to
those boundaries.
Pgina 8 de 45
The second step is to analyse the format of the document. What appearance-based features
need to translate between the print and electronic objects? Some of the common elements relate to
attributes such as bold, italic and typeface. Then there are other aspects that take a bit more thought, such
as special characters. These require special tags, for example &Aelig; for . However, cases do exist of
characters which cannot be encoded and alternate provisions must be made. Format issues also include notes
and annotations (items that figure heavily in scholarly texts), marginal glosses, and indentations. Elements of
format are easily forgotten, so be sure to go through the representative documents and choose the visual
aspects of the text that must be carried through to the electronic object.
The third encoding feature concerns document content. This is where you will go through the
document looking for features that are neither structural nor format based. This is the point where you can
highlight the content information necessary to the text and the user. Refer back to the decisions made about
textual relationships and what themes and ideas should be highlighted. If, for example, you are creating a
database of author biographies you might want to encode such features as author's name, place of birth,
written works, spouse, etc. Having a clear sense of the likely users of the project will make these decisions
easier and perhaps more straightforward. This is also a good time to evaluate what the methodology will be
for dealing with textual revisions, deletions, and additions either authorial or editorial. Again, it is not so
critical here to define what element tags you are using but rather to arrive at a listing of features that need
to be encoded. Once these steps have been taken you are ready to move on to the digitization process.
Chapter 3: Digitization Scanning, OCR, and Re-keying
3.1: What is digitization?
Digitization is quite simply the creation of a computerised representation of a printed analog.
There are many methods of digitizing and varied media to be digitized. However, as this guide is concerned
with the creation of electronic texts, it will focus primarily on text and images, as these are the main objects
in the digitization process. This chapter will address such issues as scanning and image capture, necessary
hardware and software concerns, and a more lengthy discussion of digitizing text.
For discussions of digitizing other formats, audio and video for example, there are many
thorough analyses of procedure. Peter Robinson's The Digitization of Primary Textual Sources covers most
aspects of the decision making process and gives detailed explanations of all formats. 'On-Line Tutorials and
Digital Archives' or 'Digitising Wilfred', written by Dr Stuart Lee and Paul Groves, is the final report of their
JTAP Virtual Seminars project and takes you step by step through the process and how the various
digitization decisions were made. They have also included many helpful worksheets to help scope and cost your
own project. For a more current study of the digitization endeavour, refer to Stuart Lee's Scoping the Future
of Oxford's Digital Collections at http://www.bodley.ox.ac.uk/scoping, which examined Oxford's current and
future digitization projects. Appendix E of the study provides recommendations applicable to those outside of
the Oxford community by detailing the fundamental issues encountered in digitization projects.
While the above reports are extremely useful in laying out the steps of the digitization process,
they suffer from the inescapable liability of being tied to the period in which they are written. In other
words, recommendations for digitizing are constantly changing. As hardware and software develop, so does the
quality of digitized output. The price cuts in storage costs allow smaller projects to take advantage of archival
imaging standards (discussed below). This in no way detracts from the importance of the studies produced by
scholars such as Lee, Groves, and Robinson; it simply acknowledges that the fluctuating state of digitization
must be taken into consideration when project planning. Keeping this in mind, the following sections will
attempt to cover the fundamental issues of digitization without focusing on ephemeral discussion points.
3.2: The digitization chain
The digitization chain is a concept expounded by Peter Robinson in his aforementioned
publication. The idea is based upon the fundamental concept that the best quality image will result from
digitizing the original object. If this is not an attainable goal, then digitization should be attempted with as
few steps removed from the original as possible. Therefore, the chain is composed of the number of
intermediates that come between the original object and the digital image the more intermediates, the
more links in the chain (Robinson 1993).
This idea was then extended by Dr Lee so that the digitization chain became a circle in which
every step of the project became a separate link. Each link attains a level of importance so that if one piece of
Pgina 9 de 45
the chain were to break, the entire project would fail (Groves and Lee 1999). While this is a useful concept in
project development, it takes us away from the object of this chapter digitization so we'll lean more
towards Robinson's concept of the digitization chain.
As will soon become apparent with the discussion of imaging hardware and software, having very
few links in the digitization chain will make the project flow more smoothly. Regardless of the technology
utilised by the project, the results will depend, first and foremost, on the quality of the image being scanned.
Scanning a copy of a microfilm of an illustration originally found in a journal is acceptable if it is the only option
you have, but clearly scanning the image straight from the journal itself is going to make an immeasurable
difference in quality. This is one important reason for carefully choosing the hardware and software. If you
know that you are dealing with fragile manuscripts that cannot handle the damaging light of a flatbed scanner,
or a book whose binding cannot open past a certain degree, then you will probably lean towards a digital camera.
If you have text that is from an 18th-century book, with fading pages and uneven type, you will want the best
text scanning software available. Knowing where your documents stand in the digitization chain will influence
the subsequent imaging decisions you will make for the project.
3.3: Scanning and image capture
The first step in digitization, both text and image, is to obtain a workable facsimile of the page.
To accomplish this you will need a combination of hardware and software imaging tools. This is a somewhat
difficult area to address in terms of recommending specific product brands, as what is considered industry (or
at least the text creation industry) standard is subject to change as technology develops. However, this
chapter will discuss some of the hardware and software frequently used by archives and digital project
creators.
3.3.1: Hardware Types of scanner and digital cameras
There are quite a few methods of image capture that are used within the humanities community.
The equipment ranges from scanners (flatbed, sheetfed, drum, slide, microfilm) to high-end digital cameras. In
terms of standards within the digitizing community, the results are less than satisfactory. Projects tend to
choose the most available option, or the one that is affordable on limited grant funding. However, two of the
most common and accessible image capture solutions are flatbed scanners and high-resolution digital cameras.
Flatbed scanners
Flatbed scanners have become the most commonplace method for capturing images or text. Their
name comes from the fact that the scanner is literally a flat glass bed, quite similar to a copy machine, on
which the image is placed face down and covered. The scanner then passes light-sensitive sensors over the
illuminated page, breaking it into groups of pixel-sized boxes. It then represents each box with a zero or a
one, depending on whether the pixel is filled or empty. The importance of this becomes more apparent with the
discussion of image type below.
As a result of their lowering costs and widespread availability, the use of quality flatbeds ranges
from the professional digital archiving projects to the living rooms of the home computer consumer. One
benefit of this increased use and availability is that flatbed scanning technology is evolving continually. This
has pushed the purchasing standards away from price and towards quality. In an attempt to promote the more
expensive product, the marketplace tends to hype resolution and bit-depth, two aspects of scanning that are
important to a project (see section 3.4) but are not the only concerns when purchasing hardware. While it is
not necessarily the case that you need to purchase the most expensive scanner to get the best quality digital
image, it is unlikely that the entry-level flatbeds (usually under 100 pounds/dollars) will provide the image
quality that you need. However, while it used to be the case that to truly digitize well you needed to purchase
the more high-end scanner, at a price prohibitive to most projects, the advancing digitizing needs of users
have pushed hardware developers to create mid-level scanners that reach the quality of the higher range.
As a consumer, you need to possess a holistic view of the scanner's capabilities. Not only should
the scanner provide you with the ability to create archival quality images (discussed in section 3.4.2) but it
should also make the digitization process easier. Many low-cost scanners do not have high-grade lenses, optics,
or light sources, thereby creating images that are of a very poor quality. The creation of superior calibre
images relates to the following hardware requirements (www.scanjet.hp.com/shopping/list.htm):
the quality of the lens, mirrors, and other optics hardware;
the mechanical stability of the optical system;
Pgina 10 de 45
the focal range and stability of the optical system;
the quality of the scanning software and many other hardware and software features.
Also, many of the better quality scanners contain tools that allow you to automate some of the
procedures. This is extremely useful with such things as colour and contrast where, with the human eye, it is
difficult to achieve the exact specification necessary for a high-quality image. Scanning hardware has the
ability to provide this discernment for the user, so these intelligent automated features are a necessity to
decrease task time.
Digital cameras
One of the disadvantages of a flatbed scanner is that to capture the entire image the document
must lie completely flat on the scanning bed. With books this poses a problem because the only way to
accomplish this is to bend the spine to the breaking point. It becomes even worse when dealing with texts with
very fragile pages, as the inversion and pressure can cause the pages to flake away or rip. A solution to this
problem, one taken up by many digital archives and special collections departments, is to digitize with a stand-
alone digital camera.
Digital cameras are by far the most dependable means of capturing quality digital images. As
Robinson explains,
They can digitize direct from the original, unlike the film-based methods of microfilm scanning or Photo CD.
They can work with objects of any size or shape, under many different lights, unlike flatbed scanners. They
can make images of very high resolution, unlike video cameras (Robinson 1993, 39).
These benefits are most clearly seen in the digitization of manuscripts and early printed books
objects that are difficult to capture on a flatbed because of their fragile composition. The ability to digitize
with variant lighting is a significant benefit as it won't damage the make-up of the work, a precaution which
cannot be guaranteed with flatbed scanners. The high resolution and heightened image quality allows for a level
of detail you would expect only in the original. As a result of these specifications, images can be delivered at
great size. A good example of this is the Early American Fiction project being developed at UVA's Electronic
Text Center and Special Collections Department. (http://etext.lib.virginia.edu/eaf/intro.html)
The Early American Fiction project, whose goal is the digitization of 560 volumes of American
first editions held in the UVA Special Collections, is utilizing digital cameras mounted above light tables. They
are working with camera backs manufactured by Phase One attached to Tarsia Technical Industries Prisma 45
4x5 cameras on TTI Reprographic Workstations. This has allowed them to create high quality images without
damaging the physical objects. As they point out in their overview of the project, the workflow depends upon
the text being scanned, but the results work out to close to one image every three minutes. While this might
sound detrimental to the project timeline, it is relatively quick for an archival quality image. The images can be
seen at such a high-resolution that the faintest pencil annotations can be read on-screen. Referring back to
Robinson's digitization chain (3.2) we can see how this ability to scan directly from the source object prevents
the 'degradation' found in digitizing documents with multiple links between original and computer.
3.3.2: Software
Making specific recommendations for software programs is a problematic proposition. As has
been stated often in this chapter, there are no agreed 'standards' for digitization. With software, as with
hardware, the choices made vary from project to project depending upon personal choice, university
recommendations, and often budgetary restrictions. However, there are a few tools that are commonly seen in
use with many digitization projects. Regardless of the brand of software purchased, the project will need text
scanning software if there is to be in-house digitization of text and an image manipulation software package if
imaging is to be done. There are a wide variety of text scanning softwares available, all with varying
capabilities. The intricacies of text scanning are discussed in greater detail below, but the primary
consideration with any text scanning software is how well it works with the condition of the text being
scanned. As this software is optimised for laser quality printouts, projects working with texts from earlier
centuries need to find a package that has the ability to work through more complicated fonts and degraded
page quality. While there is no standard, most projects work with Caere's OmniPage scanning software. In
terms of image manipulation, there are more choices depending upon what needs to be done. For image-by-
image manipulation, including converting TIFFs to web-deliverable JPEGs and GIFs, Adobe Photoshop is the
more common selection. However, when there is a move towards batch conversion, Graphic's DeBabelizer Pro is
Pgina 11 de 45
known for its speed and high quality. If the conversion is being done in a UNIX environment, the XV operating
system is also a favourite amongst digitization projects.
3.4: Image capture and Optical Character Recognition (OCR)
As discussed earlier, electronic text creation primarily involves the digitization of text and
images. Apart from re-keying (which is discussed in 3.5), the best method of digitizing text is Optical
Character Recognition (OCR). This process is accomplished through the utilisation of scanning hardware in
conjunction with text scanning software. OCR takes a scanned image of a page and converts it into text.
Similarly, image capture also requires an image scanning software to accompany the hardware. However, unlike
text scanning, image capture has more complex requirements in terms of project decisions and, like almost
everything else in the digitization project, benefits from clearly thought out objectives.
3.4.1: Imaging issues
The first decision that must be made regarding image capture concerns the purpose of the
images being created. Are the images simply for web delivery or are there preservation issues that must be
considered? The reason for this is simple: the higher quality the image need be, the higher the settings
necessary for scanning. Once this decision has been made there are two essential image settings that must be
established what type of image will be scanned (greyscale? black and white? colour?) and at what resolution.
Image types
There are four main types of images: 1-bit black and white, 8-bit greyscale, 8-bit colour and 24-
bit colour. A bit is the fundamental unit of information read by the computer, with a single bit being
represented by either a '0' or a '1'. A '0' is considered an absence and a '1' is a presence, with more complex
representations of information being accommodated by multiple or gathered bits (Robinson 1993, 100).
A 1-bit black and white image means that the bit can either be black or white. This is a rarely
used type and is completely unsuitable for almost all images. The only amenable image for this format would be
printed text or line graphics for which poor resulting quality did not matter. Another drawback of this type is
that saving it as a JPEG compressed image one of the most prevalent image formats on the web is not a
feasible option.
8-bit greyscale images are an improvement from 1-bit as they encompass 256 shades of grey. It
is often used for non-colour images (see the Wilfred Owen Archive at http://www.hcu.ox.ac.uk/jtap/) and
provides a clear image rather than the resulting fuzz of a 1-bit scan. While greyscale images are often
considered more than adequate, there are times when non-colour images should be scanned at a higher colour
because the finite detail of the hand will come through distinctly (Robinson 1993, 28). Also, the consistent
recommendation is that images that are to be considered preservation or archival copies should be scanned as
24-bit colour.
8-bit colour is similar to 8-bit greyscale with the exception that each bit can be one of 256
colours. The decision to use 8-bit colour is completely project dependent, as the format is appropriate for web
page images but can come out somewhat grainy. Another factor is the type of computer the viewer is using, as
older ones cannot handle an image above 8-bit, so it will convert a 24-bit image to the lower format. However,
the factor to take into consideration here is primarily storage space. An 8-bit image, while not having the
quality of a higher format, will be markedly smaller.
If possible, 24-bit colour is the best scanning choice. This option provides the highest quality
image, with each bit having the potential to contain one of 16.8 million colours. The arguments against this
image format are the size, cost and time necessary. Again, knowing the objectives of the project will assist in
making this decision. If you are trying to create archival quality images, this is taken as the default setting.
24-bit colour makes the image look more photo-realistic, even if the original is greyscale. The thing to
remember with archival quality imaging is that if you need to go back and manipulate the image in any way, it
can be copied and adjusted. However, if you scan the image as a lesser format then any kind of retrospective
adjustments will be impossible. While a 24-bit colour archived image can be made greyscale, an 8-bit greyscale
image cannot be converted into millions of colours.
Resolution
The second concern relates to the resolution of the image. The resolution is determined by the
number of dots per inch (dpi). The more dots per inch in the file, the more information is being stored about
Pgina 12 de 45
the image. Again, this choice is directly related to what is being done with the image. If the image is being
archived or will need to be enlarged, then the resolution will need to be relatively higher. However, if the
image is simply being placed on a web page, then the resolution drops drastically. As with the choices in image
type, the dpi ranges alter the file size. The higher the dpi, the larger the file size. To illustrate the
differences, I will replicate an informative table created by the Electronic Text Center, which examines an
uncompressed 1" x 1" image in different types and resolutions.
Resolution (dpi) 400x400 300x300 200x200 100x100
2-bit black and white 20K 11K 5K 1K
8-bit greyscale or colour 158K 89K 39K 9K
24-bit colour 475K 267K 118K 29K
Clearly the 400 dpi scan of a 24-bit colour image is going to be the largest file size, but is also
one of the best choices for archival imaging. The 100 dpi image is attractive not only for its small size, but
because screen resolution rarely exceeds this amount. Therefore, as stated earlier, the dpi choice depends on
the project objectives.
File formats
If, when using an imaging software program, you click on the 'save as' function to finalise the
capture, you will see that there are quite a few image formats to choose from. In terms of text creation there
are three types fundamental to the process: TIFF, JPEG, and GIF. These are the most common image formats
because they transfer to almost any platform or software system.
TIFF (Tagged Image File Format) files are the most widely accepted format for archival image
creation and retention as master copy. More so than the following formats, TIFF files can be read by almost
all platforms, which also makes it the best choice when transferring important images. Most digitization
projects begin image scanning with the TIFF format, as it allows you to gather as much information as possible
from the original and then saves these data. This touches on the only disadvantage of the TIFF format the
size of the image. However, once the image is saved, it can be called up at any point and be read by a computer
with a completely different hardware and software system. Also, if there exists any possibility that the
images will be modified at some point in the future, then the images should be scanned as TIFFs.
JPEG (Joint Photographic Experts Group) files are the strongest format for web viewing and
transfer through systems that have space restrictions. JPEGs are popular with image creators not only for
their compression capabilities but also for their quality. While a TIFF is a lossless compression, JPEGs are a
lossy compression format. This means that as a filesize condenses, the image loses bits of information.
However, this does not mean that the image will markedly decrease in quality. If the image is scanned at 24-
bit, each dot has the choice of 16.8 million colours more than the human eye can actually differentiate on
the screen. So with the compression of the file, the image loses the information least likely to be noticed by
the eye. The disadvantage of this format is precisely what makes it so attractive the lossy compression.
Once an image is saved, the discarded information is lost. The implication of this is that the entire image, or
certain parts of it, cannot be enlarged. Additionally, the more work done to the image, requiring it to be re-
saved, the more information is lost. This is why JPEGs are not recommended for archiving there is no way to
retain all of the information scanned from the source. Nevertheless, in terms of viewing capabilities and
storage size, JPEGs are the best method for online viewing.
GIF (Graphic Interchange Format) files are an older format that are limited to 256 colours. Like
TIFFs, GIFs use a lossless compression format without requiring as much storage space. While they don't have
the compression capabilities of a JPEG, they are strong candidates for graphic art and line drawings. They also
have the capability to be made into transparent GIFs meaning that the background of the image can be
rendered invisible, thereby allowing it to blend in with the background of the web page. This is frequently used
in web design but can have a beneficial use in text creation. There are instances, as mentioned in Chapter 2,
where it is possible that a text character cannot be encoded so that it can be read by a web browser. It could
be inline images (a head-piece for example) or the character is not defined by ISOLAT1 or ISOLAT2. When
the UVA Electronic Text Center created an online version of the journal Studies in Bibliography, there were
instances of inline special characters that simply could not be rendered through the available encoding. As the
journal is a searchable full-text database, providing a readable page image was not an option. Their solution to
this, one that did not disrupt the flow of the digitized text, was to create a transparent GIF of the image.
Pgina 13 de 45
These GIFs were made so that they matched the size of the surrounding text and subsequently inserted quite
successfully into the digitized document.
Referring back to the discussion on image types, the issue of file size tends to be one that comes
up quite often in digitization. It is the lucky project or archive that has an unlimited amount of storage space,
so most creators must contemplate how to achieve quality images that don't take up the 55mb of space needed
by a 400 dpi, archival quality TIFF. However, it is easy to be led astray by the idea that the lower the bit the
better the compression. Not so! Once again, the Electronic Text Center has produced a figure that illustrates
how working with 24-bit images, rather than 8-bit, will produce a smaller JPEG along with a higher quality
image file.
300 dpi 24-bit colour image: 2.65 x 3.14 inches:
uncompressed TIFF: 2188 K
'moderate loss' JPEG: 59 K
300 dpi 8-bit colour image: 2.65 x 3.14 inches:
uncompressed TIFF: 729 K
'moderate loss' JPEG: 76 K
100 dpi 24-bit colour image: 2.65 x 3.14 inches:
uncompressed TIFF: 249 K
'moderate loss' JPEG: 9 K
100 dpi 8-bit color image: 2.65 x 3.14 inches:
uncompressed TIFF: 85 K
'moderate loss' JPEG: 12 K
(http://etext.lib.virginia.edu/helpsheets/scanimage.html)
While the sizes might not appear to be that markedly different, remember that these results
were calculated with an image that measures approximately 3x3 inches. Turn these images into page size,
calculate the number that can go into a project, and the storage space suddenly becomes much more of an
issue. Therefore, not only does 24-bit scanning provide a better image quality, but the compressed JPEG will
take less of the coveted project space.
So now that the three image formats have been covered, what should you use for your project?
In the best possible situation you will use a combination of all three. TIFFs would not be used for online
delivery, but if you want your images to have any future use, either for archiving, later enlarging, manipulation,
or printing, or simply as a master copy, then there is no other format in which to store the images. In terms of
online presentation, then JPEGs and GIFs are the best method. JPEGs will be of a better calibre and smaller
filesize but cannot be enlarged or they will pixelate. Yet in terms of viewing quality their condition will almost
match the TIFF. How you use GIFs will depend on what types of images are associated with the project.
However, if you are making thumbnail images that link to a separate page which exhibits the JPEG version,
then GIFs are a popular choice for that task.
In terms of archival digital image creation there seems to be some debate. As the Electronic
Text Center has pointed out, there is a growing dichotomy between preservation imaging and archival imaging.
Preservation imaging is defined as 'high-speed, 1-bit (simple black and white) page images shot at 600 dpi and
stored as Group 4 fax-compressed files' (http://etext.lib.virginia.edu/helpsheets/specscan.html). The results
of this are akin to microfilm imaging. While this does preserve the text for reading purposes, it ignores the
source as a physical object. Archiving often presupposes that the objects are being digitized so that the
source can be protected from constant handling, or as an international means of accessibility. However, this
type of preservation annihilates any chance of presenting the object as an artefact. Archiving an object has an
entirely different set of requirements. Yet, having said this, there is also a prevalent school of thought in the
archiving community that the only imaging that can be considered of archival value is film imaging, which is
thought to last at least ten times as long as a digital image. Nonetheless, the idea of archival
imaging is still discussed amongst projects and funding bodies and cannot be overlooked.
There is no set standard for archiving, and you might find that different places and projects
recommend another model. However, the following type, format and resolution are recommended:
Pgina 14 de 45
24-bit: There really is little reason to scan an archival image at anything less. Whether the source is
colour or greyscale, the images are more realistic and have a higher quality at this level. As the above
example shows, the filesize of the subsequently compressed image does not benefit from scanning at a
lower bit-size.
600 dpi: This is, once again, a problematic recommendation. Many projects assert that scanning in at
300 or 400 dpi provides sufficient quality to be considered archival. However, many of the top
international digitization centres (Cornell, Oxford, Virginia) recommend 600 dpi as an archival
standard it provides excellent detail of the image and allows for quite large JPEG images to be
produced. The only restrictive aspect is the filesize, but when thinking in terms of archival images you
need to try and get as much storage space as possible. Remember, the master copies do not have to be
held online, as offline storage on writeable CD-ROMs is another option.
TIFF: This should come as no surprise given the format discussion above. TIFF files, with their
complete retention of scanned information and cross-platform capabilities are really the only choice
for archival imaging. The images maintain all of the information scanned from the source and are the
closest digital replication available. The size of the file, especially when scanned at 24-bit, 600 dpi,
will be quite large, but well worth the storage space. You won't be placing the TIFF image online, but it
is simple to make a JPEG image from the TIFF as a viewing copy.
This information is provided with the caveat that scanning technology is constantly changing for
the better. It is more than likely that in the future these standards will become pass, with higher standards
taking their place.
3.4.2: OCR issues
The goal of recognition technology is to re-create the text and, if desired, other elements of the
page including such things as tables and layout. Refer back to the concept of the scanner and how it takes a
copy of the image by replicating it with the patterns of bits the dots that are either filled or unfilled. OCR
technology examines the patterns of dots and turns them into characters. Depending upon the type of scanning
software you are using, the resulting text can be piped into many different word processing or spreadsheet
programs. Caere OmniPage released version 10.0 in the Autumn of 1999, which boasts the new Predictive
Optical Word Recognition Plus+ (POWR++) technology. As the OmniPage factsheet explains,
POWR++ enables OmniPage Pro to recognize standard typefaces, without training, from 4 to 72 point sizes.
POWR++ recognizes 13 languages (Brazilian Portuguese, British English, Danish, Dutch, Finnish, French, German,
Italian, Norwegian, Portuguese, Spanish, Swedish, and U.S English) and includes full dictionaries for each of
these languages. In addition, POWR++ identifies and recognizes multiple languages on the same page
(http://www.caere.com/products/omnipage/pro/factsheet.asp).
However, OCR software programs (including OmniPage) are very up-front about the fact that
their technology is optimised for laser printer quality text. The reasoning behind this should be readily
apparent. As scanning software attempts to examine every pixel in the object and then convert it into a filled
or empty space, a laser quality printout will be easy to read as it has very clear, distinct, characters on a crisp
white background a background that will not interfere with the clarity of the letters. However, once books
become the object type, the software capabilities begin to degrade. This is why the first thing you must
consider if you decide to use OCR for the text source is the condition of the document to be scanned. If the
characters in the text are not fully formed or there are instances of broken type or damaged plates, the
software will have a difficult time reading the material. The implications of this are that late 19th and 20th-
century texts have a much better chance of being read well by the scanning software. As you move further
away from the present, with the differences in printing, the OCR becomes much less dependable. The changes
in paper, moving from a bleached white to a yellowed, sometimes foxed, background creates noise that the
software must sift through. Then the font differences wreak havoc on the recognition capabilities. The gothic
and exotic type found in the hand-press period contrasts markedly with the computer-set texts of the late
20th century. It is critical that you anticipate type problems when dealing with texts that have such forms as
long esses, sloping descenders, and ligatures. Taking sample scans with the source materials will help pinpoint
some of these digitizing issues early on in the project.
While the advantages of exporting text in different word processing formats are quite useful if
you are scanning in a document to print or to compensate for an accidentally deleted file, there are a few
issues that should take priority with the text creator. Assuming you are using a software program such as
Pgina 15 de 45
OmniPage, you should aim for a scan that retains some formatting but not a complete page element replication.
As will be explained in greater detail in Chapter 4, when text is saved with formatting that relates to a
specific program (Word, WordPerfect, even RTF) it is infused with a level of hidden markup a markup that
explains to the software program what the layout of the page should look like. In terms of text creation, and
the long-term preservation of the digital object, you want to be able to control this markup. If possible,
scanning at a setting that will retain font and paragraph format is the best option. This will allow you to see
the basic format of the text I'll explain the reason for this in a moment. If you don't scan with this setting
and opt for the choice that eliminates all formatting, the result will be text that includes nothing more than
word spacing there will be no accurate line breaks, no paragraph breaks, no page breaks, no font
differentiation, etc. Scanning at a mid-level of formatting will assist you if you have decided to use your own
encoding. As you proofread the text you will be able to add the structural markup chosen for the project.
Once this has been completed the text can be saved out in a text-only format. Therefore, not only will you
have the digitized text saved in a way that will eliminate program-added markup, but you will also have a basic
level of user-dictated encoding.
3.5: Re-keying
Unfortunately for the text creator, there are still many situations where the documents or
project preclude the use of OCR. If the text is of a poor or degraded quality, then it is quite possible that the
time spent correcting the OCR mistakes will exceed that of simply typing in the text from scratch. The amount
of information to be digitized also becomes an issue. Even if the document is of a relatively good quality, there
might not be enough time to sit down with 560 volumes of texts (as with the Early American Fiction project)
and process them through OCR. The general rule of thumb, and this varies from study to study, is that a best-
case scenario would be three pages scanned per minute this doesn't take into consideration the process of
putting the document on the scanner, flipping pages, or the subsequent proofreading. If, when addressing
these concerns, OCR is found incapable of handling the project digitization, the viable solution is re-keying the
text.
Once you've made this decision, the next question to address is whether to handle the document
in-house or out-source the work. Deciding to digitize the material in-house relies on having all the necessary
hardware, software, staff, and time. There are a few issues that come into play with in-house digitization. The
primary concern is the speed of re-keying. Most often the re-keying is done by the research assistants
working on the project, or graduate students from the text creator's local department. The problem here is
that paying an hourly rate to someone re-keying the text often proves more expensive than out-sourcing the
material. Also, there is the concern that a single person typing in material tends to overlook keyboarding
errors and if the staff member is familiar with the source material, there is a tendency to correct
automatically those things that seem incorrect. So while in-house digitization is an option, these concerns
should be addressed from the outset.
The most popular choice with many digitization projects (Studies in Bibliography, The Early
American Fiction Project, Historical Collections for the National Digital Library and the Chadwyck-Healey
databases to name just a few) is to out-source the material to a professional keyboarding company. The
fundamental benefit most often cited is the almost 100% accuracy rate of the companies. One such company,
Apex Data Services, Inc. (used by the University of Virginia Electronic Text Center), promises a conversion
accuracy of 99.995%, along with 100% structural accuracy, and reliable delivery schedules. Their ADEPT
software allows the dual-keyboarders to witness a real-time comparison, allowing for a single-entry verification
cycle (http://www.apexinc.com/dcs/dcs_index.html). Also, by employing keyboarders who do not possess a
subject speciality in the text being digitized many, for that matter, often do not speak the language being
converted they avoid the problem of keyboarders subconsciously modifying the text. Keyboarding companies
are also able to introduce a base-level encoding scheme, established by the project creator, into the
documents, thereby eliminating some of the more rudimentary tagging tasks.
Again, as with most steps in the text creation process, the answers to these questions will be
project dependent. The decisions made for a project that plans to digitize a collection of works will be
markedly different from those made by an academic who is creating an electronic edition. It reflects back, as
always, to the importance of the document analysis stage. You must recognise what the requirements of the
project will be, and what external influences (especially staff size, equipment availability, and project funding)
will affect the decision-making process.
Pgina 16 de 45
Chapter 4: Markup: The key to reusability
4.1: What is markup?
Markup is most commonly defined as a form of text added to a document to transmit information
about both the physical and electronic source. Do not be surprised if the term sounds familiar; it has been in
use for centuries. It was first used within the printing trade as a reference to the instructions inscribed onto
copy so that the compositor would know how to prepare the typographical design of the document. As Philip
Gaskell points out, 'Many examples of printers' copy have survived from the hand-press period, some of them
annotated with instructions concerning layout, italicization, capitalization, etc.' (Gaskell 1995, 41). This concept
has evolved slightly through the years but has remained entwined with the printing industry. G.T. Tanselle
writes in a 1981 article on scholarly editing, 'one might...choose a particular text to mark up to reflect these
editorial decisions, but that text would only be serving as a convenient basis for producing printer's copy...'
(Tanselle 1981, 64). There still seems to be some demarcation between the usage of the term for bibliography
and for computing, but the boundary is really quite blurred. The leap from markup as a method of labelling
instructions on printer's copy to markup as a language used to describe information in an electronic document
is not so vast.
Therefore when we think of markup there are really three differing types (two of which will be
discussed below). The first is the markup that relates strictly to formatting instructions found on the physical
text, as mentioned above. It is used for the creation of an emended version of the document and, with the
exception of the work of textual scholars, is rarely referred to again. Then there is the proprietary markup
found in electronic document encoding, which is tied to a specific piece of software or developer. This markup
is concerned primarily with document formatting, describing what words should be in italics or centred, where
the margins should be set, or where to place a bulleted list. There are a few things to note about this type of
markup. The first is that being proprietary means that it is intimately tied to the software that created it.
This does not pose a problem as long as the document will only remain within that software program; and as
long as the creator recognises that in the future there is no guarantee that the software will exist. This is
important, as proprietary software formats allow users to say where and how they want the document
formatted, but then the software inserts its own markup language to accomplish this. When users create
documents in Word or a PDF file, they are unconsciously adding encoding with every keystroke. As anyone who
has created a document in one software format and attempted to transfer it to another is aware, the encoding
does not transfer and if for some reason a bit of it does, it rarely means the same thing.
The third type of markup is non-proprietary, a generalised markup language. There are two
critical distinctions between this markup and the previous two. Firstly, as it is a general language and not tied
to a specific software/hardware, it offers cross-platform capabilities. This ensures that documents utilising
this style of encoding will be readable many years down the line. Secondly, while a generalised markup language,
as with the others, allows users to insert formatting markup in the document, it also allows for encoding based
upon the content of the work. This is a level of control not found in the previous styles of markup. Here the
user is able not only to describe the appearance of the document but the meanings found within it. This is a
critical aspect of electronic text creation, and therefore receives more in-depth treatment below.
4.2: Visual/presentational markup vs. structural/descriptive markup
The discussion of visual/presentational markup vs. structural/descriptive markup carries on from
the concepts of proprietary and non-proprietary markup. As the name implies, presentational markup is
concerned with the visual structure of a text. Depending upon what processing software is being used, the
markup explains to the computer how the document should appear. So if the work should be seen in 12 point,
Tahoma font, the software dictates a markup so that this happens. Presentational markup is concerned with
structure only insofar as it relates to the visual aspect of the document. It does not care whether a heading is
for a book, a chapter or a paragraph the only consideration is how that heading should look on the page.
Most proprietary language formats tend to focus solely on presentational issues. To move into descriptive
markup would require that the software provide the document creator with the ability to formulate their own
tags with which to encode the structure and presentation of the work.
In other words, descriptive markup relates less to the visual strategy of the work and more to
the reasons behind the structure. It allows the creator to encode the document with a markup that more
Pgina 17 de 45
clearly shows how the presentation, configuration, and content relate to the document as a whole. Once again,
the beneficial effects of thorough document analysis can be seen. Having a holistic sense of the document,
having the detailed listing of critical elements in the document, will exemplify how descriptive markup advances
a project. In this case, a non-proprietary language will be the most beneficial, as it will allow the document
creators to arrive at their own tagsets, providing a much needed level of control over the encoding
development.
4.2.1: PostScript and Portable Document Format (PDF)
In 1985, Adobe Systems created a programming language for printers called PostScript. In so
doing, they produced a system that allowed computers to 'talk' to their printers. This language describes for
the printer the appearance of the page, incorporating elements like text, graphics, colour, and images, so that
documents maintain their integrity through the transmission from computer to printer. PostScript printers
have become industry standard with corporations, marketers, publishing companies, graphic designers, and
more. Printers, slide recorders, imagesetters all these output devices utilise PostScript technology. Combine
this with PostScript's multiple operating system capability and it becomes clear why PostScript has become
the standard for printing technology. (http://www.adobe.com/print/features/psvspdf/main.html). PostScript
language can be found in most printers Epson, IBM, and Hewlett-Packard just to name a few almost
guaranteeing that a high standard of printing can be found in both the home and office. Adobe provides a list
of compatible products at http://www.adobe.com/print/postscript/oemlist.html.
Portable Document Format (PDF) was created by Adobe in 1993 to complement their PostScript
language. PDF allows the user to view a document with a presentational integrity that almost resembles a
scanned image of the source. This delivery of visually rich content is the most attractive use of PDF. The
format is entirely concerned with keeping the document intact, and, to ensure this, allows any combination of
text, graphics and images. It also has full, rich colour presentation and is therefore often used with corporate
and marketing graphic arts materials. Another enticing feature, depending on the quality of the printer, is that
when a PDF file is printed out, the hard copy output is an exact replication of the screen image. PDF is also
desirable for its delivery strengths. Not only does the document maintain its visual integrity, but it also can be
compressed. This compression eases on-line and CD-ROM transmission and assists its archiving opportunities.
PDF files can be read through an Acrobat Reader application that is freely available for download
via the web. This application is also capable of serving as a browser plug-in for online document viewing.
Creating PDF files is a bit more complicated than the viewing procedure. To write a PDF document it is
necessary to purchase Adobe software. PDFWriter allows the user to create the PDF document, and the more
expensive Adobe Capture program will convert TIFF files into PDF formatted text versions. If the user would
like the document to become more interactive, offering the ability to annotate the document for example,
then this functionality can be added with the additional purchase of Acrobat Exchange, which serves an
editorial function. Exchange allows the user to annotate and edit the document, search across documents and
also has plug-ins that provide highlighting ability.
Taking into consideration the earlier discussion of visual vs. structural markup, it is clear how
programs like PostScript and PDF fall into the category of a proprietary processing language concerned with
presentational rather than descriptive markup. This does not imply that these languages should be avoided. On
the contrary, if the only concern is how the document appears both on the screen and through the printer,
then software of this nature is appropriate. However, if the document needs to cross platforms or the project
objectives require control over the encoding or document preservation, then these proprietary programs are
not dependable.
4.2.2: HTML 4.0
HyperText Markup Language (or HTML as it is commonly known) is a non-proprietary format
markup system used for publishing hypertext on the World Wide Web. To date, it has appeared in four main
versions (1.0, 2.0, 3.2, 4.0), with the World Wide Web Consortium (W3C) recommending 4.0 as the markup
language of choice. HTML is a derivative of SGML the Standard Generalised Markup Language. SGML will be
discussed in greater detail in Chapter 5, but suffice it to say that it is an international standard metalanguage
that defines a set of rules for device-independent, system-independent methods of encoding electronic texts.
SGML allows you to create your own markup language but provides the rules necessary to ensure its processing
Pgina 18 de 45
and preservation. HTML is a successful implementation of the SGML concepts, and, as a result, is accessible to
most browsers and platforms. Along with this, it is a relatively simple markup language to learn, as it has a
limited tagset. HTML is by far the most popular web-publishing language, allowing users to create online text
documents that include multimedia elements (such as images, sounds, and video clips), and then put these
documents in an environment that allows for instant publication and retrieval.
There are many advantages to a markup language like HTML. As mentioned above, the primary
benefit is that a document encoded with HTML can be viewed in almost any browser an extremely attractive
option for a creator who wants documents which can be viewed by an audience with varied systems. However, it
is important to note that while the encoding can cross platforms, there are consistently differences in page
appearance between browsers. While W3C recommends the usage of HTML 4.0, many of its features are
simply not available to users with early versions of browsers. Unlike PDF which is extremely concerned with
keeping the document and its format intact, HTML has no true sense of page structure and files can neither
be saved nor printed with any sense of precision.
Besides the benefit of a markup language that crosses platforms with ease, HTML attracts its
many users for the simple manner with which it can be mastered. For users who do not want to take the time
to learn the tagset, the good news is that conversion-to-HTML tools are becoming more accessible and easier
to use. For those who cannot even spare the time to learn how to use HTML-creation software, of which there
are a limited quantity, they can sit down with any text creation program (Notepad for example) and author an
HTML document. Then by using the 'Open File'; tool in a browser, the document can immediately be viewed.
What this means for novice HTML authors is that they can sit down with a text creator and a browser and
teach themselves a markup language in one session. And as David Seaman, Director of the Electronic Text
Center at the University of Virginia, points out:
[this] has a real pedagogical value as a form of SGML that makes clear to newcomers the concept of
standardized markup. To the novice, the mass of information that constitutes the Text Encoding Initiative
Guidelines the premier tagging scheme for most humanities documents is not easily grasped. In contrast,
the concise guidelines to HTML that are available on-line (and usually as a "help" option from the menu of a
Web client) are a good introduction to some of the basic SGML concepts. (Seaman 1994).
This is of real value to the user. The notion of marking up a text is quite often an overwhelming
concept. Most people do not realise that markup enters into their life every time they make a keystroke in a
word processing program. So for the uninitiated, HTML provides a manageable stepping-stone into the world of
more complex encoding. Once this limited tagset is mastered, many users find the jump into an extended
markup language less intimidating and more liberating.
However, one of the drawbacks to this easy authoring language is that many of the online
documents are created without a DTD. A DTD is the abbreviation for a document type definition, which
outlines the formal specifications for an SGML encoded document. Basically, a DTD is the method for spelling
out the SGML rules that the document is following. It sets the standards for what markup can be used and
how this markup interacts with others. So, for example, if you create an HTML document with a specific
software program, say HoTMetaL PRO, the resulting text will begin with a document type declaration stating
which DTD is being used. A sample declaration from a HoTMetaL creation looks like this:
<!DOCTYPE HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19970714::extensions to HTML 4.0//EN"
"hmpro4.dtd">
As can be seen in the above statement, the declaration explains that the document will follow the HoTMetaL
PRO 4.0 DTD. In so doing, the markup language used must adhere to the rules set out in this specific DTD. If
it does not then the document cannot be successfully validated and will not work.
As it stands now, web browsers require neither a DTD nor a document type declaration. Browsers
are notoriously lax in their HTML requirements, and unless something serious is missing from the encoded
document it will be successfully viewed through a Web client. The impact of this is that while HTML provides a
convenient and universal markup language for a user, many of the documents floating out in cyberspace are
permeated with invalid code. The focus then moves away from authoring documents that conform to a set of
encoding guidelines and towards the creation of works that can be viewed in a browser (Seaman 1994). This
problem will become more severe with the increased use of Extensible Markup Language, or XML as it is more
commonly known. This markup language, which is being lauded as the new lingua franca, combines the visual
Pgina 19 de 45
benefits of HTML with the contextual benefits of SGML/TEI. However, while XML will have the universality
of HTML, the web clients will require a more stringent adherence to markup rules. While documents that
comply with the rules of an HTML DTD will find the transition relatively simple, the documents that were
constructed strictly with viewing in mind will require a good deal of clean up prior to conversion.
This is not to say that HTML is not a useful tool for creating online documents. As in the case of
PostScript and PDF, the choice to use HTML should be document dependent. It is the perfect choice for
static documents that will have a short shelf-life. If you are creating course pages or supplementary materials
regarding specific readings that will not be necessary or available after the end of term, then HTML is an
appropriate choice. If, however, you are concerned about presentational and structural integrity, the markup
of document content and/or the long-term preservation of the text, then a user-definable markup language is
a much better choice.
4.2.3: User-definable descriptive markup
A user-definable descriptive markup is exactly what its name implies. The content of the markup
tags is established solely by the user, not by the software. As a result of SGML and its concept of a DTD, a
document can have any kind of markup a creator desires. This frees the document from being married to
proprietary hardware or software and from its reliance upon an appearance-based markup language. If you
decide to encode the document with a non-proprietary language, which we highly recommend, then this is a
good time to evaluate the project goals. While a user-definable markup language gives you control over the
content of the markup, and thereby more control over the document, the markup can only be fully understood
by you. Although not tied to a proprietary system, it is also not tied to any accepted standard. A markup
language defined and implemented by you is simply that a personal non-proprietary markup system.
However, if the electronic texts require a language that is non-proprietary, more extensive and
content-oriented than HTML, and comprehensible and acceptable to a humanities audience, then there is a
solution the Text Encoding Initiative (TEI). TEI is an international implementation of SGML, providing a non-
proprietary markup language that has become the de facto standard in Humanities Computing. TEI, which is
explained more fully in Chapter 5, provides 'a full set of tags, a methodology, and a set of Document Type
Descriptions (DTDs) that allow the detailed (or not so detailed) description of the spatial, intellectual,
structural, and typographic form of a work' (Seaman 1994).
4.3: Implications for long-term preservation and reuse
Markup is a critical, and inescapable, part of text creation and processing. Regardless of the
method chosen to encode the document, some form of markup will be included in the text. Whether this
markup is proprietary or non-proprietary, appearance- or content-based is up to you. Be sure to evaluate the
project goals when making the encoding decisions. If the project is short-lived or necessarily software
dependent, then the choices are relatively straightforward. However, if you are at all concerned about long-
term preservation, cross-platform capabilities, and/or descriptive markup, then a user-definable (preferably
TEI) markup language is the best choice. As Peter Shillingsburg corroborates:
...the editor with a universal encoding system developing an electronic edition with a multiplatform application
has created a tool available to anyone with a computer and has ensured the longevity of the editorial work
through generations to come of software and hardware. It seems worth the effort (Shillingsburg 1996, 163).
Chapter 5: SGML/XML and TEI
The previous chapter showed what markup is, and how it plays a crucial role in almost every
aspect of information processing. Now we shall learn about some crucial applications of descriptive markup
which are ideally suited to the types of texts studied by those working in the arts and humanities disciplines.
5.1: The Standard Generalized Markup Language (SGML)
The late 1970s and early 1980s saw a consensus emerging that descriptive markup languages had
numerous advantages over other types of text encoding. A number of products and macro languages appeared
which were built around their own descriptive markup languages and whilst these represented a step
forward, they were also constrained by the fact that users were required to learn a new markup language each
Pgina 20 de 45
time, and could only describe those textual features which the markup scheme allowed (sometimes extensions
were possible, but implementing them was rarely a straightforward process).
The International Standards Organisation (ISO) also recognised the value of descriptive markup
schemes, and in 1986 an ISO committee released a new standard called ISO 8879, the Standard Generalized
Markup Language (SGML). This complex document represented several years' effort by an international
committee of experts, working together under the Chairmanship of Dr Charles Goldfarb (one of the creators
of IBM's descriptive markup language, GML). Since SGML was a product of the International Standards
process, the committee also had the benefit of input from experts from the numerous national standards
bodies associated with the ISO, such as the UK's British Standards Institute (BSI).
5.1.1: SGML as metalanguage
A great deal of largely unjustified mystique surrounds SGML. You do not have to look very hard
to find instances of SGML being described as 'difficult to learn', 'complex to implement', or 'expensive to
use', when in fact it is none of these things. People all too frequently confuse the acronym, SGML, with SGML
applications many of which are indeed highly sophisticated and complex operations, designed to meet the
rigorous demands of blue chip companies working in major international industries (automotive, pharmaceutical,
or aerospace engineering). It should not be particularly surprising that a documentation system designed to
control and support every aspect of the tens of thousands of pages of documentation needed to build and
maintain a battleship, fix the latest passenger aircraft, or supplement a legal application for international
recognition for a new advanced drug treatment, should appear overwhelmingly complex to an outsider. In fact,
despite its name, SGML is not even a markup language. Instead, it would be more appropriate to call SGML a
'metalanguage'.
In a conventional markup language, such as HTML, users are offered a pre-defined set of markup
tags from which they must make appropriate selections; if they suddenly introduce new tags which are not
part of the HTML specification, then it is clear that the resulting document will not be considered valid HTML,
and it may be rejected or incorrectly processed by HTML software (e.g. an HTML-compatible browser). SGML,
on the other hand, does not offer a pre-defined set of markup tags. Rather, it offers a grammar and specific
vocabulary which can be used to define other markup languages (hence 'metalanguage').
SGML is not constrained to any one particular type of application, and it is neither more nor less
suited to producing technical documentation and specifications in the semiconductor industry, than it is for
marking up linguistic features of ancient inscribed tablets of stone. In fact, SGML can be used to create a
markup language to do pretty well anything, and that is both its greatest strength and weakness. SGML cannot
be used 'out-of-the-box', so to speak, and because of this it has earned an undeserved reputation in some
quarters as being troublesome and slow to implement. On the other hand, there are many SGML applications
(and later we shall learn about one in particular), which can be used straightaway, as they offer a fully
documented markup language which can be recognised by any one of a suite of tools and implemented with a
minimum of fuss. SGML provides a mechanism for like-minded people with a shared concern to get together
and define a common markup language which satisfies their needs and desires, rather than being limited by the
vision of the designers of a closed, possibly proprietary markup scheme which only does half the job.
SGML offers another advantage in that it not only allows (groups of) users to define their own
markup languages, it also provides a mechanism for ensuring that the rules of any particular markup language
can be rigorously enforced by SGML-aware software. For example, within HTML, although there are six
different levels of heading defined (e.g. the tags <H1> to <H6>) there is no requirement that they should be
applied in a strictly hierarchical fashion; in other words, it is perfectly possible for a series of headings in an
HTML document to be marked up as <H1>, then <H3>, followed by <H5>, followed in turn by <H2>, <H4>, and <H6>
all to achieve a particular visual appearance in a particular HTML browser. By contrast, should such a
feature be deemed important, an SGML-based markup language could be written in such a way that suitable
software can ensure that levels of heading nest in a strictly hierarchical fashion (and the strength of this
approach can perhaps become even more evident when encoding other kinds of hierarchical structure, e.g. a
<BOOK> must contain one or more <CHAPTER>s, each of which must in turn contain one or more <PARAGRAPH>s,
and so on). We shall learn more about this in the following section.
Pgina 21 de 45
There is one final, crucial, difference between SGML-based markup languages and other
descriptive markup languages: the process by which International Standards are created, maintained, and
updated. ISO Standards are subject to periodic formal review, and each time this work is undertaken it
happens in full consultation with the various national standards bodies. The Committee which produced SGML
has guaranteed that if and when any changes are introduced to the SGML standard, this will be done in such a
way as to ensure backwards compatibility. This is not a decision which has been undertaken lightly, and the full
implications can be inferred from the fact that commercial enterprises rarely make such an explicit
commitment (and even when they do, users ought to reflect upon the likelihood that such a commitment will
actually be fulfilled given the considerable pressures of a highly competitive marketplace). The essential
difference has been characterised thus: the creators of SGML believe that a user's data should belong to
that user, and not be tied up inextricably in a proprietary markup system over which that user has no control;
whereas, the creators of a proprietary markup scheme can reasonably be expected to have little motivation to
ensure that data encoded using their scheme can be easily migrated to, or processed by, a competitor's
software products.
5.1.2: The SGML Document
The SGML standard gives a very rigid definition as to what constitutes an SGML document.
Whilst there is no need for us to consider this definition in detail at this stage, it is worthwhile reviewing the
major concepts as they offer a valuable insight into some crucial aspects of an electronic text. Perhaps first
and foremost amongst these is the notion that an SGML document is a single logical entity, even though in
practice that document may be composed of any number of physical data files, spread over a storage medium
(e.g. a single computer's hard-disk) or even over different types of storage media connected together via a
network. As today's electronic publications become more and more complex, mixing (multilingual) text with
images, audio, and image data, it reinforces the need to ensure that they are created in line with accepted
standards. For example, an article from an electronic journal mounted on a website may be delivered to the
end-user in the form of a single HTML document, but that article (and indeed the whole journal), may rely upon
dozens or hundreds of data files, a database to manage the entire collection of files, several bespoke scripts
to handle the interfacing between the web and the database, and so on. Therefore, whenever we talk about an
electronic document, it is vitally important to remember that this single logical entity may, in fact, consist of
many separate data files.
SGML operates on the basis of there being three major parts which combine to form a single
SGML document. Firstly, there is the SGML declaration, which specifies any system and software constraints.
Secondly, there is the prolog, which defines the document structure. Lastly, there is the document instance,
which contains what one would ordinarily think of as the document. Whilst this may perhaps appear
unnecessarily complicated, in fact it provides an extremely valuable insight into the key components which are
essential to the creation of an electronic document.
The SGML declaration tells any software that is going to process an SGML document all that it
should need to know. For example, the SGML declaration specifies which character sets have been used in the
document (normally ASCII or ISO 646, but more recently this could be Unicode, or ISO 10646). It also
establishes any constraints on system variables (e.g. the length of markup tag names, or the depth to which
tags can be nested), and states whether or not any of SGML's optional features have been used. The SGML
standard offers a default set-up, so that, for example, the characters < and > are used to delimit markup tag
names and with the widespread acceptance of HTML, this has become the accepted way to indicate markup
tags but if for any reason this presented a problem for a particular application (e.g. encoding a lot of data in
which < and > were heavily used to indicate something else), it would be possible to redefine the delimiters as
@ or #, or whatever characters were deemed to be more appropriate.
The SGML declaration is important for a number of reasons. Although it may seem an unduly
complicated approach, it is often these fundamental system or application dependencies which make it so
difficult to move data around between different software and hardware environments. If the developers of
wordprocessing packages had started off by agreeing on a single set of internal markup codes they would all
use to indicate a change in font, the centring of a line of text, the occurrence of a pagebreak etc., then users'
lives would have been made a great deal easier; however, this did not happen, and hence we are left in a
situation where data created in one application cannot easily be read by another. We should also remember
that as our reliance upon information technology grows, time passes, applications and companies appear or go
Pgina 22 de 45
bust, there may be data which we wish to exchange or reuse which were created when the world of computing
was a very different place. It is a very telling lesson that although we are still able to access data inscribed on
stone tablets or committed to papyrii or parchment hundreds (if not thousands) of years ago, we already have
masses of computer-based data which are effectively lost to us because of technological progress, the demise
of particular markup schemes, and so on. Furthermore, by supplying a default environment, the average end-
user of an SGML-based encoding system is unlikely to have to familiarise him- or herself with the intricacies
of the SGML declaration. Indeed it should be enough simply to be aware of the existence of the SGML
declaration, and how it might affect one's ability to create, access, or exploit a particular source of data.
The next major part of an SGML document is the prolog, which must conform to the specification
set out in the formal SGML standard, and the syntax given in the SGML declaration. Although it is hard to
discuss the prolog without getting bogged down in the details of SGML, suffice it to say that it contains (at
least one) document type declaration, which in turn contains (or references) a Document Type Definition (or
DTD). The DTD is one of the single most important features of SGML, and what sets it apart from not to
say above other descriptive markup schemes. Although we shall learn a little more about the process in the
following section, the DTD contains a series of declarations which define the particular markup language which
will be used in the document instance, and also specifies how the different parts of that language can
interrelate (e.g. which markup tags are required and optional, the contexts in which they can be used, and so
on). Often, when people talk about 'using SGML', they are actually talking about using a particular DTD, which
is why some of the negative comments that have been made about SGML (e.g. 'It's too difficult.', or 'It
doesn't allow me to encode those features which I consider to be important') are erroneous, because such
complaints should properly be directed at the DTD (and thus aimed at the DTD designer) rather than at SGML
in general. Other than some of the system constraints imposed by the SGML declaration, there are no
strictures imposed by the SGML standard regarding how simple or complex the markup language defined in the
DTD should be.
Whilst the syntax used to write a DTD is fairly straightforward, and most people find that they
can start to read and write DTDs with surprising ease, to create a good DTD requires experience and
familiarity with the needs and concerns of both data creators and end-users. A good DTD nearly always
reflects a designer's understanding of all these aspects, an appreciation of the constraints imposed by the
SGML standard, and a thorough process of document analysis (see Chapter 2) and DTD-testing. In many ways
this situation is indicative of the fact that the creators of the SGML standard did not envisage that individual
users would be very likely to produce their own DTDs for highly specific purposes. Rather, they thought (or
perhaps hoped), that groups would form within industry sectors or large-scale enterprises to produce DTDs
that were tailored to the needs of their particular application. Indeed, the areas in which the uptake of SGML
has been most enthusiastic have been operating under exactly those sorts of conditions for example, the
international Air Transport Authority seeking to standardise aircraft maintenance documentation, or the
pharmaceutical industry's attempts to streamline the documentary evidence needed to support applications to
the US Food and Drug Administration. As we shall see, the DTD of prime importance to those working within
the Arts and Humanities disciplines has already been written and documented by the members of the Text
Encoding Initiative, and in that case the designers had the foresight to build in mechanisms to allow users to
adapt or extend the DTD to suit their specific purposes. However, as a general rule, if users wish to write
their own DTDs, or tweak an SGML declaration, they are entirely free to do so (within the framework set out
by the SGML standard) but the vast majority of SGML users prefer to rely upon an SGML declaration and
DTD created by others, for all the benefits of interoperability and reusability promised by this approach.
This brings us to the third main part of an SGML document: namely, the document instance itself.
This is the part of the document which contains a combination of raw data and markup, and its contents are
constrained by both the SGML declaration, and the contents of the prolog (especially the declarations in the
DTD). Clearly from the perspective of data creators and end-users, this is the most interesting part of an
SGML document and it is common practice for people to use the term 'SGML document' when they are
actually referring to a document instance. Such confusion should be largely unproblematic, provided these
users always remember that when they are interchanging data (i.e. a document instance) with colleagues, they
should also pass on the relevant DTD and SGML declaration. In the next section we shall investigate the
practical steps involved in the creation of an SGML document, and the very valuable role that can be played by
SGML-aware software.
Pgina 23 de 45
5.1.3: Creating Valid SGML Documents
How you create SGML documents will be greatly influenced by the aims of your project, the
materials you are working with, and the resources available to you. For the purposes of this discussion, let us
start by assuming that you have a collection of existing non-electronic materials which you wish to turn into
some sort of electronic edition.
If you have worked your way through the chapter on document analysis (Chapter 2), then you will
know what features of the source material are important to you, and what you will want to be able to encode
with your markup. Similarly, if you have considered the options discussed in the chapter on digitization
(Chapter 3), you will have some idea of the type of electronic files with which you will be starting to work.
Essentially, if you have chosen to OCR the material yourself, you will be using clear or plain ASCII text files,
which will need to undergo some sort of editing or translation as part of the markup process. Alternatively, if
the material has been re-keyed, then you will either have electronic text files which already contain some
basic markup, or you will also have plain ASCII text files.
Having identified the features you wish to encode, you will need to find a DTD which meets your
requirements. Rather than trying to write your own DTD from scratch, it is usually worthwhile investing some
time to look around for existing public DTDs which you might be able to adopt, extend, or adapt to suit your
particular purposes. There are many DTDs available in the public domain, or made freely available for others to
use (e.g. see Robin Cover's The SGML/XML Web Page (http://www.oasis-open.org/cover/)), but even if none of
these match your needs, some may be worth investigating to see how others have tackled common problems.
Although there are some tools available which are designed to facilitate the process of DTD-authoring, they
are probably only worth buying if you intend to be doing a great deal of work with DTDs, and they can never
compensate for poor document analysis. However, if you are working with literary or linguistic materials, you
should take the time to familiarise yourself with the work of the Text Encoding Initative (see 5.2: The Text
Encoding Initiative and TEI Guidelines), and think very carefully before rejecting use of their DTD.
Before we go any further, let us consider two other scenarios: one where you already have the
material in electronic form but you need to convert it to SGML; the other, where you will need to create SGML
from scratch. Once again, there are many useful tools available to help convert from one markup scheme to
another, but if your target format is SGML this may have some bearing on the likelihood of success (or failure)
of any conversion process. As we have seen, SGML lends itself most naturally to a structured, hierarchical view
of a document's content (although it is perfectly possible to represent very loose organisational structures,
and even non-hierarchical document webs, using SGML markup) and this means that it is much simpler to
convert from a proprietary markup scheme to SGML if that scheme also has a strong sense of structure (i.e.
adopts a descriptive markup approach) and has been used sensibly. However, if a document has been encoded
with a presentational markup scheme which has, for example, used codes to indicate that certain words should
be rendered in an italic font regardless of the fact that sometimes this has been for emphasis, at other
times to indicate book and journal titles, and elsewhere to indicate non-English words then this will
dramatically reduce the chances of automatically converting the data from this presentation-oriented markup
scheme into one which complies with an SGML DTD.
It is probably worth noting at this point that these conversion problems primarily apply when
converting from a non-descriptive, non-SGML markup language into SGML; the opposite process, namely
converting from SGML into another target markup scheme, is much more straightforward (because it would
simply mean that data variously marked-up with, say, <EMPHASIS>, <TITLE>, and <FOREIGN> tags, had their
markup converted into the target scheme's markup tags for <ITALIC>). It is also worth noting that such a
conversion might not be a particularly good idea, because you would effectively be throwing information away.
In practice it would be much more sensible to retain the descriptive/SGML version of your material, and
convert to a presentational markup scheme only when absolutely required for the successful rendering of your
data on screen or on paper. Indeed, many dedicated SGML applications support the use of stylesheets to offer
some control over the on-screen rendition of SGML-encoded material, whilst preserving the SGML markup
behind the scenes.
If you are creating SGML documents from scratch, or editing existing SGML documents (perhaps
the products of a conversion process, or the results of a re-keying exercise) there are several factors to
consider. It is essential that you have access to a validating SGML parser, which is a software program that
Pgina 24 de 45
can read an SGML declaration and a document's prolog, understand the declarations in the DTD, and ensure
that the SGML markup used throughout the document instance conforms appropriately. In many commercial
SGML- and XML-aware software packages, a validating parser is included as standard and is often very closely
integrated with the relevant tools (e.g. to ensure that any simple editing operations, such as cut and paste, do
not result in the document failing to conform to the rules set out in the DTD because markup has been
inserted or removed inappropriately). It also possible to find freeware and public domain software which have
some understanding of the markup rules expressed in the DTD, while also allowing users to validate their
documents with a separate parser in order to guarantee conformance. Your choice will probably be dictated by
the kind of software you currently use (e.g. in the case of editors: windows-based office-type applications, or
unix-style plain text editors?), the budget you have available, and the files with which you will be working.
Whatever your decision, it is important to remember that a parser can only validate markup against the
declarations in a DTD, and it cannot pick up semantic errors (e.g. incorrectly tagging a person's name as, say, a
place name, or an epigraph as if it were a subtitle).
So for the purposes of creating valid SGML documents, we have seen that there are a number of
tools which you may wish to consider. If you already have files in electronic form, you will need to investigate
translation or auto-tagging software and if you have a great many files of the same type, you will probably
want software which supports batch processing, rather than anything which requires you to work on one file at
a time. If you are creating SGML documents from scratch, or cleaning-up the output of a conversion process,
you will need some sort of editor (ideally one that is SGML-aware), and if your editor does not incorporate a
parser, you will need to obtain one that can be run as a stand-alone application (there are one or two
exceptionally good parsers freely available in the public domain). For an idea of the range of SGML and XML
tools available, readers should consult Steve Pepper's The Whirlwind Guide to SGML & XML Tools and Vendors
(http://www.infotek.no/sgmltool/guide.htm).
Producing valid SGML files which conform to a DTD, is in some respects only the first stage in
any project. If you want to search the files for particular words, phrases, or marked-up features, you may
prefer to use an SGML-aware search engine, but some people are perfectly happy writing short scripts in a
language like Perl. If you want to conduct sophisticated computer-assisted text analysis of your material, you
will almost certainly need to look at adapting an existing tool, or writing your own code. Having obtained your
SGML text, whether as complete documents or as fragments resulting from a search, you will need to find
some way of displaying it. You might choose to simply convert the SGML markup in the data into another
format (e.g. HTML for display in a conventional web browser), or you might use one of the specialist SGML
viewing packages to publish the results which is how many commercial SGML-based electronic texts are
produced. We do not have sufficient space to consider all the various alternatives in this publication, but once
again you can get an idea of the options available by looking at the The Whirlwind Guide to SGML & XML Tools
and Vendors (http://www.infotek.no/sgmltool/guide.htm) or, more generally, The SGML/XML Web Page
(http://www.oasis-open.org/cover/).
5.1.4: XML: The Future for SGML
As we saw in the previous section, an SGML-based markup language usually offers a number of
advantages over other types of markup scheme, especially those which rely upon proprietary encoding.
However, although SGML has met with considerable success in certain areas of publishing and many
commercial, industrial, and governmental sectors, its uptake by the academic community has been relatively
limited (with the notable exception of the Text Encoding Initiative, see 5.2: The Text Encoding Initiative and
TEI Guidelines, below). We can speculate on why this might be so for example, SGML has an undeserved
reputation for being difficult and expensive to produce because it imposes prohibitive intellectual overheads,
and because the necessary software is lacking (leastways at prices academics can afford). While it is true that
peforming a thorough document analysis and developing a suitable DTD should not be undertaken lightly, it
could be argued that to approach the production of any electronic text without first investing such intellectual
resources is likely to lead to difficulties (either in the usefulness or the long-term viability of the resulting
resource). The apparent lack of readily available, easy-to-use SGML software, is perhaps a more valid criticism
yet the resources have been available for those willing to look, and then invest the time necessary to learn a
new package (although freely available software tends to put more of an onus on the user than some of the
commercial products). However, what is undoubtedly true is the fact that writing a piece of SGML software
(e.g. a validating SGML parser), which fully implements the SGML standard, is an extremely demanding task
and this has been reflected in the price and sophistication of some commercial applications.
Pgina 25 de 45
Whilst SGML is probably more ubiquitous than many people realise, HTML the markup language
of the World Wide Web is much better known. Nowadays, the notion of 'the Web' is effectively
synonymous with the global Internet, and HTML plays a fundamental role in the delivery and presentation of
information over the Web. The main advantage of HTML is that it is a fixed set of markup tags designed to
support the creation of straightforward hypertext documents. It is easy to learn and easy for developers to
implement in their software (e.g. HTML editors and browsers), and the combination of these factors has
played a large part in the rapid growth and widespread acceptance of the Web. There is so much information
about HTML already available, that there is little to be gained from going into much detail here however,
readers who wish to know more should visit the W3C's HyperText Markup Language Home Page
(http://www.w3.org/MarkUp/).
Although HTML was not originally designed as an application of SGML, it soon became one once
the designers realised the benefits to be gained from having a DTD (e.g. a validating parser could be used to
ensure that markup had been used correctly, and so the resulting files would be easier for browsers to
process). However, this meant that the HTML DTD had to be written retrospectively, and in such a way that
any existing HTML documents would still conform to the DTD which in turn meant that the value of the DTD
was effectively diminished! This situation led to the release of a succession of different versions of HTML,
each with their own slightly different DTD. Nowadays, the most widely accepted release of HTML is probably
version 3.2, although the World Wide Web Consortium (W3C) released HTML 4.0 on 18th December 1997 in
order to address a number of outstanding concerns about the HTML standard. Future versions of HTML are
probably unlikely, although there is work going on within the HTML committees of W3C to take into account
other developments within the W3C, and this has led to proposals such as the XHTML 1.0 Proposed
Recommendation document released on 24th August 1999 (see http://www.w3.org/TR/1999/PR-xhtml1-
19990824/).
It is perfectly possible to deliver SGML documents over the Web, but there are several ways
that this can be achieved and each has different implications. In order to retain the full 'added-value' of the
SGML markup, you might choose to deliver the raw SGML data over the Web and rely upon a behind-the-
scenes negotation between your web-server and the client's browser to ensure that an appropriate SGML-
viewing tool is launched on the client's machine. This enables the end-user to exploit fully the SGML markup
included in your document, provided that s/he has been able to obtain and install the appropriate software.
Another possibility would be to offer a Web-to-SGML interface on your server, so that end-users can access
your documents using an ordinary Web browser whilst all the processing of the SGML markup takes place on
the server, and the results are delivered as HTML. Alternatively, you might decide to simply convert the
markup into HTML from whatever SGML DTD has been used to encode the document (either on-the-fly, or as
part of a batch process) so that the end-user can use an ordinary Web browser and the server will not have to
undertake any additional processing. The last of these options, while placing the least demands on the end-
user, effectively involves throwing away all the extra intellectual information that is represented by the SGML
encoding; for example, if in your original SGML document, proper nouns, place names, foreign words, and
certain types of emphasis have each been encoded with different markup according to your SGML DTD, they
may all be translated to <EM> tags in HTML and thus any automatically identifiable distinction between
these different types of content will probably have been lost. The first option retains the advantages of using
SGML, whilst placing a significant onus on the end-user to configure his Web browser correctly to launch
supporting applications. The second option represents a middle way: exploiting the SGML markup whilst
delivering easy-to-use HTML, but with the disadvantage of having to do much more sophisticated processing at
the Web server.
Until recently, therefore, those who create and deliver electronic text were confronted with a
dilemma: to use their own SGML DTD with all the additional processing overheads that entails, or use an HTML
DTD and suffer a diminution of intellectual rigour and descriptive power? Extending HTML was not an option
for individuals and projects, because the developers of Web tools were only interested in supporting the
flavours of HTML endorsed by the W3C. Meanwhile, delivering electronic text marked-up according to another
SGML DTD meant that end-users were obliged to obtain suitable SGML-aware tools, and very few of them
seemed willing to do this. One possible solution to this dilemma is the Extensible Markup Language (XML) 1.0
(see http://www.w3.org/TR/REC-xml), which became a W3C Recommendation (the nearest thing to a formal
standard) on 10th February 1998.
Pgina 26 de 45
The creators of XML adopted the following design goals:
1. XML shall be straightforwardly usable over the Internet.
2. XML shall support a wide variety of applications.
3. XML shall be compatible with SGML.
4. It shall be easy to write programs which process XML documents.
5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
6. XML documents should be human-legible and reasonably clear.
7. The XML design should be prepared quickly.
8. The design of XML shall be formal and concise.
9. XML documents shall be easy to create.
10. Terseness in XML markup is of minimal importance.
They sought to gain the generic advantages offered by supporting arbitrary SGML DTDs, whilst
retaining much of the operational simplicity of using HTML. To this end, they 'threw away' all the optional
features of the SGML standard which make it difficult (and therefore expensive) to process. At the same
time they retained the ability for users to write their own DTDs, so that they can develop markup schemes
which are tailored to suit particular applications but which are still enforceable by a validating parser. Perhaps
most importantly of all, the committee which designed XML had representatives from several major companies
which develop software applications for use with the Web, particularly browsers, and this has helped to
encourage a great deal of interest in XML's potential.
SGML has its roots in a time when creating, storing, and processing information on computer was
expensive and time-consuming. Many of the optional features supported by the SGML standard were intended
to make it cheaper to create and store SGML-conformant documents in an era when it was envisaged that all
the markup would be laboriously inserted by hand, and megabytes of disk space were extremely expensive.
Nowadays, faster and cheaper processors, and the falling costs of storage media (both magnetic and optical),
mean that the designers and users of applications are less worried about the concerns of SGML's original
designers. On the other hand, the ever growing volume of electronic information makes it all the more
important that any markup which has been used has been applied in a thoroughly consistent and easy to process
manner, thereby helping to ensure that today's applications perform satisfactorily.
XML addresses these familiar concerns, whilst taking advantage of modern computer systems
and the lessons learned from using SGML. For example, now that the cost of storing data is of less concern to
most users (except for those dealing with extremely large quantities of data), there is no need to offer
support for markup short-cuts which, while saving storage space, tend to impose an additional load when
information is processed. Instead, XML's designers were able to build-in the concept of 'well-formed' data,
which requires that any marked-up data are explicitly bounded by start- and end-tags, and that all the tagged
data in a document nest appropriately (so that it becomes possible, say, to generate a document tree which
captures the hierarchical arrangement of all the data elements in the document). This has the added
advantage that when two applications (such as a database and a web server) need to exchange data, they can
use well-formed XML as their interchange format, because both the sending and receiving application can be
certain that any data they receive will be appropriately marked-up and there can be no possible ambiguity
about where particular data structures start and end.
XML takes this approach one stage further by adopting the SGML concept of DTDs, such that an
XML document is said to be 'valid' if it has an associated DTD and the markup used in the document has been
checked (by a validating parser) against the declarations expressed in that DTD. If an application knows that
it will be handling valid XML, and has an understanding of and access to the relevant DTD, this can greatly
improve its ability to process that data for example, a search and retrieval application would be able to
Pgina 27 de 45
construct a list of all the marked-up data structures in the document, so that a user could refine the search
criteria accordingly. Knowing that a vast collection of XML documents have all been validated against a
particular DTD will greatly assist the processing of that collection, as valid XML data is also necessarily well-
formed. By contrast, while it is possible for an XML application to process a well-formed document such that it
can derive one possible DTD which could represent the data structures it contains, that DTD may not be
sufficient to represent all the well-formed XML documents of the same type. There are clearly many
advantages to be gained from creating and using valid XML data, but the option remains to use well-formed
XML data in those situations where it would be appropriate.
Today's Web browsers expect to receive conformant HTML data, and any additional markup
included in the data which is not recognised by the browser is usually ignored. The next generation of Web
browsers will know how to handle XML data, and while all of them will know how to process HTML data by
default, they will also be prepared to cope with any well-formed or valid XML data that they receive. This
offers the opportunity for groups of users to come together, agree upon a DTD they wish to adopt, and then
create and exchange valid XML data which conforms to that DTD. Thus, a group of academics concerned with
the creation of electronic scholarly editions of major texts could all agree to prepare their data in accordance
with a particular DTD which enabled them to markup the features of the texts which they felt to be
appropriate for their work. They could then exchange the results of their labours safe in the knowledge that
they could all be correctly processed by their favourite software (whether browsers, editors, text analysis
tools, or whatever).
Readers who wish to explore the similarities and differences between SGML and XML are
advised to consult the sources mentioned on Robin Cover's The SGML/XML Web Page (http://www.oasis-
open.org/cover/). Projects which have invested heavily in the creation of SGML-conformant resources are
well-placed to take advantage of XML developments, because any conversions that are required should be
straightforward to implement. However, it is important to bear in mind that at the moment XML is just one of
a suite of emerging standards, and it may be a little while yet before the situation becomes completely clear.
For example the Extensible Stylesheet Language (XSL) Specification (http://www.w3.org/TR/WD-xsl/) for
expressing stylesheets as XML documents is still under development, as are the proposals to develop XML
Schema (http://www.w3.org/TR/xmlschema-1/ ), which may ultimately replace the role of DTDs when creating
XML documents (and provide support not just for declaring data structures, but also for strong data typing
such that it would be possible to ensure, say, that the contents of a <DATE> element conformed to a particular
international standard date format).
5.2: The Text Encoding Initiative and TEI Guidelines
5.2.1: A brief history of the TEI
(Much of the following text is extracted from publicly available TEI documents, and is
reproduced here with minor amendments and the permission of the TEI Editors.)
The TEI began with a planning conference convened by the Association for Computers and the
Humanities (ACH), gathering together over thirty experts in the field of electronic texts, representing
professional societies, research centers, and text and data archives. The planning conference was funded by
the U.S. National Endowment for the Humanities (NEH &endash; an independent federal agency) and took place
at Vassar College, Poughkeepsie, New York on 1213 November 1987.
Those attending the conference agreed that there was a pressing need for a common text
encoding scheme that researchers could use when creating electronic texts, to replace the existing system in
which every text provider and every software developer had to invent and support their own scheme (since
existing schemes were typically ad hoc constructs with support for the particular interests of their creators,
but not built for general use). At a similar conference ten years earlier, one participant pointed out, everyone
had agreed that a common encoding scheme was desirable, and predicted chaos if one was not developed. At
the Poughkeepsie meeting, no one predicted chaos: everyone agreed that chaos has already arrived.
After two days of intense discussion, the participants in the meeting reached agreement on the
desirability and feasibility of creating a common encoding scheme for use both in creating new documents and
in exchanging existing documents among text and data archives; the closing statement the Poughkeepsie
Pgina 28 de 45
Principles (see http://www-tei.uic.edu/orgs/tei/info/pcp1.html) enunciated precepts to guide the creation of
such a scheme.
After the planning conference, the task of developing an encoding scheme for use in creating
electronic texts for research was undertaken by three sponsoring organisations: the Association for
Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association
for Literary and Linguistic Computing (ALLC). Each sponsoring organisation named representatives to a
Steering Committee, which was responsible for the overall direction of the project. Furthermore, a number of
other interested professional societies were involved in the project as participating organisations, and each of
these named a representative to the TEI Advisory Board.
With support from NEH and later from the Commission of the European Communities and the
Andrew W. Mellon Foundation, the TEI began the task of developing a draft set of Guidelines for Electronic
Text Encoding and Interchange. Working committees, comprising scholars from all over North America and
Europe, drafted recommendations on various aspects of the problem, which were integrated into a first public
draft (document TEI P1), which was published for public comment in June 1990.
After the publication of the first draft, work began immediately on its revision. Fifteen or so
specialised work groups were assigned to refine the contents of TEI P1 and to extend it to areas not yet
covered. So much work was produced that a bottleneck ensued getting it ready for publication, and the second
draft of the Guidelines (TEI P2) was released chapter by chapter from April 1992 through November 1993.
During 1993, all published chapters were revised yet again, some other necessary materials were added, and
the development phase of the TEI came to its conclusion with the publication of the first 'official' version of
the Guidelines the first one not labelled a draft in May 1994 (Sperberg-McQueen and Burnard 1994).
Since that time, the TEI has concentrated on making the Guidelines (TEI P3) more accessible to users,
teaching workshops and training users, and on preparing ancillary material such as tutorials and introductions.
5.2.2: The TEI Guidelines and TEI Lite
The goals outlined in the Poughkeepsie Principles (see http://www-
tei.uic.edu/orgs/tei/info/pcp1.html) were elaborated and interpreted in a series of design documents, which
recommended that the Guidelines should:
suffice to represent the textual features needed for research
be simple, clear, and concrete
be easy for researchers to use without special purpose software
allow the rigorous definition and efficient processing of texts
provide for user-defined extensions
conform to existing and emergent standards
As the product of many leading members of the research community, it is perhaps not surprising
that research needs are the prime focus of the TEI's Guidelines. The TEI established a plethora of work
groups covering everything from 'Character Sets' and 'Manuscripts and Codicology', to 'Historical
Studies'and 'Machine Lexica' in order to ensure that the interests of the various sectors of the arts and
humanities research community were adequately represented. As one of the co-editors of the Guidelines,
Michael Sperberg-McQueen wrote, 'Research work requires above all the ability to define rigorously (i.e.
precisely, unambiguously, and completely) both the textual objects being encoded and the operations to be
performed upon them. Only a rigorous scheme can achieve the generality required for research, while at the
same time making possible extensive automation of many text-management tasks.' (Sperberg-McQueen and
Burnard 1995, 18). As we saw in the previous section (5.1 The Standard Generalized Markup Language), SGML
offers all the necessary techniques to define and enforce a formal grammar, and so it was chosen as the basis
for the TEI's encoding scheme.
Pgina 29 de 45
The designers of the TEI also had to decide how to reconcile the need to represent the textual
features required by researchers, with their other expressed intention of keeping the design simple, clear,
and concrete. They concluded that rather than have many different SGML DTDs (i.e. one for each area of
research), they would develop a single DTD with sufficient flexibility to meet a range of scholars' needs. They
began by resolving that wherever possible, the number of markup elements should not proliferate
unnecessarily (e.g. have a single <NOTE> tag with a TYPE attribute to say whether it was a footnote, endnote,
shouldernote etc., rather than having separate <FOOTNOTE>, <ENDNOTE>, <SHOULDERNOTE> tags). Yet as
this would still result in a large and complex DTD, they also decided to implement a modular design grouping
sets of markup tags according to particular descriptive functions so that scholars could choose to mix and
match as many or as few of these markup tags as they required. Lastly, in order to meet the needs of those
scholars whose markup requirements could not be met by this comprehensive DTD, they designed it in such a
way that the DTD could be adapted or extended in a standard fashion, thereby allowing these scholars to
operate within the TEI framework and retain the right to claim compliance with the TEI's Guidelines.
There is no doubt that the TEI's DTD and Guidelines can appear rather daunting at first,
especially if one is unfamiliar with descriptive markup, text encoding issues, or SGML/XML applications.
However, for anyone seriously concerned about creating an electronic textual resource which will remain viable
and usable in the 'long-term' (which can be less than a decade in the rapidly changing world of information
technology), the TEI's approach certainly merits very serious investigation, and you should think very carefully
before deciding to reject the TEI's methods in favour of another apparent solution.
The importance of the modularity and extensibility of the TEI's DTD cannot be over-stated. In
order to make their design philosophy more accessible to new users of text encoding and SGML/XML, the
creators of the TEI's DTD have developed what they describe as the 'Chicago pizza model' of DTD
construction. Every Chicago (indeed, U.S.) pizza must have certain ingredients in common namely, cheese and
tomato sauce; pizza bases can be selected from a pre-determined limited range of types (e.g. thin-crust, deep-
dish, or stuffed), whilst pizza toppings may vary considerably (from a range of well-known ingredients, through
to local specialities or idiosyncratic preferences!). In the same way every implementation of the TEI DTD must
have certain standard components (e.g. header information and the core tag set), one of the eight base tag
sets (see below), to which can then be added any combination of the additional tag sets or user-defined
application-specific extensions. TEI headers are discussed in more detail in Chapter 6, whilst the core tag set
consists of common elements which are not specific to particular types of text or research application (e.g. the
<P> tag used to identify paragraphs). Of the eight base tag sets, six are designed for use with texts of one
particular type (i.e. prose, verse, drama, transcriptions of spoken material, printed dictionaries, and
terminological data), whilst the other two (general, and mixed) allow for anthologies or unrestricted mixing of
the other base types. The additional tag sets (the pizza toppings) provide the necessary markup tags for
describing such things as hypertext linking, the transcription of primary sources (especially manuscripts),
critical apparatus, names and dates, language corpora, and so on. Readers who wish to know more should consult
the full version of the Guidelines, which are also available online at http://www.hcu.ox.ac.uk/TEI/P4beta/.
Even the brief description given above is probably enough to indicate that while the TEI scheme
offers immense descriptive possibilities, its application is not something to be undertaken lightly. With this in
mind, the designers of the TEI DTD developed a couple of pre-built versions of the DTD, of which the best
known and most widely used is called 'TEI Lite'. Each aspect of the TEI Lite DTD is documented in TEI Lite:
An Introduction to Text Encoding for Interchange (Burnard and Sperberg-McQueen 1995), which is also
available online at http://www.hcu.ox.ac.uk/TEI/Lite/. The abstract of this document states that TEI Lite
'can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize
the usability of electronic transcriptions and to facilitate their interchange among scholars using different
types of computer systems'. Indeed, many people find that the TEI Lite DTD is more than adequate for their
purposes, but even for those who do need to use the other tag sets available in the full TEI DTD, TEI Lite
provides a valuable introduction to the TEI's encoding scheme. Several people involved in the development and
maintenance of the TEI DTD have continued to investigate ways to facilitate its use, such as the 'Pizza Chef'
(available at http://www.hcu.ox.ac.uk/TEI/newpizza.html) which offers a web-based method of combining
the various tag sets to make your own TEI DTD and an XML version of TEI Lite (see The TEI Consortium
Homepage (http://www.tei-c.org/)). It can only be hoped that as more people appreciate the merits of
adopting the TEI scheme, the number of freely available SGML/XML TEI-aware tools and applications will
continue to grow.
Pgina 30 de 45
5.3: Where to find out more about SGML/XML and the TEI
Although SGML was released as an ISO standard in 1986, its usage has grown steadily rather
than explosively, and uptake has tended to occur within the documentation departments of major corporations,
government departments, and global industries. This is in dramatic contrast to XML, which was released as a
W3C Recommendation in 1998 but was able to build on the tremendous level of international awareness about
the web and HTML (and, to some extent, on the success of SGML in numerous corporate sectors). As a very
simple indicator, on the 20th August 1999 the online catalogue of amazon.co.uk (http://amazon.co.uk) listed
only 28 books with 'SGML' in the title, as compared to 68 which mention 'XML' (and 5 of these are common to
both!).
One of the best places to find out more about both the SGML and XML standards, their
application, relevant websites, discussion lists and newsgroups, journal articles, conferences and the like, is
Robin Cover's excellent The SGML/XML Web Page (http://www.oasis-open.org/cover/). It would be pointless
to reproduce a selection of Cover's many references here (as they would rapidly go out of date), but readers
are strongly urged to visit this website and use it to identify the most relevant information sources. However,
it is also important to remember that XML (like SGML), is only one amongst a family of related standards, and
that these XML-related standards are developing and changing very rapidly so you should remember to visit
these sites regularly, or risk making the wrong decisions on the basis of out-dated information.
Keeping up-to-date with the Text Encoding Initiative is a much more straightforward matter.
The website of the TEI Consortium (http://www.tei-c.org/) provides the best starting point to accessing other
TEI-related online resources, whilst the TEI-L@LISTSERV.UIC.EDU discussion list is an active forum for
anyone interested in using the TEI's Guidelines and provides an extremely valuable source of advice and
support.
Chapter 6 : Documentation and Metadata
6.1 What is Metadata and why is it important?
Simply put, metadata is one piece of data which describes another piece of data. In the context
of digital resources the kind of information you would expect to find in a typical metadata record would be
data on the nature of a resource, who created the resource, what format it is held in, where it is held, and so
on. In recent years the issue of metadata has become a serious topic for those concerned with the creation
and management of digital resources. When digital resources first started to emerge much of the focus of
activity was centred on the creation process, without much thought given to how these resources would be
documented and found by others. In the academic arena announcements of the availability of resources tended
to be within an interested community, usually though subject-based discussion lists. However, as use of the
web has steadily increased, many institutions have come to depend on it as a crucial means of storing and
distributing information. The means by which this information is organised has now become a central issue if
the web is to continue to be an effective tool for the digital information age.
While there is an overwhelming consensus that a practical metadata model is required, a single
one has yet to emerge which will satisfy the needs of the net community as a whole. This section of the Guide
will look at two metadata models currently in use, the Dublin Core Element Set, and the TEI Header, but we
begin with an overview of the problem as it stands at the moment.
The concept of metadata has been around much longer then the web, and while there exist a
great number of metadata formats, it is most often associated with the work of the library community. The
web is commonly likened to an enormous library for the digital age, and while this analogy may not stand up to
any serious scrutiny, it is a useful one to make as it highlights the perceived problems associated with
metadata and digital resources and points towards possible solutions. At its inception the web was not designed
nor intended as a forum for the organised publication and retrieval of information and therefore no system for
effectively cataloguing information held on the web was devised. Due to this lack of formal cataloguing
procedures the web has evolved into a 'chaotic repository for the collective output of the world's digital
"printing presses"' (Lynch 1997). Locating an item on a library shelf is a relatively simple task due to our
familiarity with a long-established procedure for doing so. Library metadata systems, such as MARC, follow a
Pgina 31 de 45
strictly defined set of rules which are applied by a trained body of professionals. The web has few such
parallels.
One of the most common ways of locating items on the web is via a search engine, and it is to
these that the proper application of metadata would be most beneficial. While search engines are undeniably
powerful they do not operate in an effective and precise enough way to make them trustworthy tools of
retrieval. It is estimated that there are in the region of three and a half million web sites containing five
hundred million unique files (OCLC Web Characterisation Project, June 99
http://www.oclc.org/oclc/research/projects/webstats), only one-third of which are indexed by search engines.
The web contains much that is difficult to catalogue in a straightforward manner multimedia packages, audio
and visual material, not to mention pages which are automatically generated and all demand consideration in
any system which attempts to catalogue them. The method by which search engines index a web site is based
on the frequency of occurrences of words which appear in the document rather than identifying any real notion
of its content. The indiscriminate nature of the searches not only make it difficult to find what you are looking
for but often bury any potentially useful information in a flurry of unwanted, unrelated 'hits'. The growing
commercialisation of the web has influenced the nature of search engines and made them even more unreliable
and of dwindling practical use to the academic community.
While search engines are now able to make better use of the HTML tag (although the tag can be
open to abuse by index spamming), it is perhaps a case of too little too late. Initiatives such as the Dublin Core
go some way in trying to redress the balance, but these are still being refined and have numerous
shortcomings. The Dublin Core, in an attempt to maintain its simplicity fails to achieve its hoped for
functionality, trading off much of its potential precision in a quest for general acceptance. The Dublin Core
element set is, in places, too general to describe coherently the complex relationships which exist within many
digital resources, and lacks the required rigidity, in areas such as the use of controlled vocabulary, to make it
easily interoperable. This applies particularly in regard to the original unqualified 15 elements, but the work of
bodies such as the Dublin Core Data Model working group, implementing Dublin Core in RDF/XML, are providing
potential solutions to these problems (http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-
rdf/). While a single metadata scheme, adopted and implemented wholescale would be the ideal, it is probable
that a proliferation of metadata schemes will emerge and be used by different communities. This makes the
current work centred on integrated services and interoperability all the more important.
6.1.1: Conclusion and current developments
The need for a solution to the problem of how to document data on the web so that they can be
located and retrieved with the minimum of effort is now essential if the web is to continue to thrive as a major
provider of our daily resources. It is generally recognised that what is required is a metadata scheme which
contains 'the librarian's classification and selection skillscomplemented by the computer scientist's ability to
automate the task of indexing and storing information' (Lynch 1997). Existing models do not go far enough in
providing a framework that satisfies the precise requirements of different communities and discipline groups,
and until clear guidelines become available on how metadata records should be created in a standardised way,
little progress will be made. In the foreseeable future it is unlikely that some outside agent will prepare your
metadata for you, and proper investment in web cataloguing methods is therefore essential if its
implementation is to be executed successfully.
New developments and proposals are being investigated in an attempt to find solutions in the face
of these seemingly insurmountable problems. The Warwick Framework
(http://www.ukoln.ac.uk/metadata/resources/wf.html) for example suggests the concept of a container
architecture, which can support the coexistence of several independently developed and maintained metadata
packages which may serve other functions (rights management, administrative metadata, etc.). Rather than
attempt to provide a metadata scheme for all web resources, the Warwick Framework uses the Dublin Core as
a starting point, but allows individual communities to extend this to fit their own subject-specific
requirements. This movement towards a more decentralised, modular and community-based solution, where the
'communities of expertise' themselves create the metadata they need has much to offer. In the UK, various
funded organisations such as the AHDS (http://ahds.ac.uk/), and projects like ROADS
(http://www.ilrt.bris.ac.uk/roads/) and DESIRE (http://www.desire.org/) are all involved in assisting the
development of subject-based information gateways that provide metadata-based services tailored to the
needs of particular user communities.
Pgina 32 de 45
It is clear that there is still some way to go before the problems of metadata for describing
digital resources have been adequately resolved. Initiatives created to investigate the issues are still in their
infancy, but hopefully solutions will be found, either globally or within distinct communities, which will provide a
framework simple enough to be used by the maximum number of people with the minimum degree of
inconvenience.
6.2 The TEI Header
The work and objectives of the Text Encoding Initiative (TEI) and the guidelines it produced for
text encoding and interchange have already been discussed in Chapter 5. In this section dealing with metadata,
we will focus on how the TEI has approached the problems particular to the effective documentation of
electronic texts. This section will look at the TEI Header, and specifically, the version of the header as
provided by the TEI Lite DTD (http://www.hcu.ox.ac.uk/TEI/Lite/)
Unlike the Dublin Core element set, the TEI Header is not designed specifically for describing
and locating objects on the web, although it can be used for this purpose. The TEI Header provides a
mechanism for fully documenting all aspects of an electronic text. The TEI Header does not limit itself to
documenting the text only, but also provides a system for documenting its source, its encoding practices, and
the process of its creation. The TEI Header is therefore an essential resource of information for users of the
text, for software that has to process the metadata information, and for cataloguers in libraries, museums,
and archives. In contrast with the Dublin Core, whose inclusion in any document is voluntary, the presence of
the TEI Header is mandatory if the document is to be considered TEI conformant.
As with the full TEI Lite tag set, a number of optional elements are offered by the TEI Header
(of which only one, the <filedesc>, is mandatory) for use in a structured way. These elements are capable of
being extended by the addition of attributes on the elements. Therefore the TEI Header can range from a
very large and complex document to a simple, concise piece of metadata. The most basic valid TEI Lite header
would look something like:
<teiHeader><fileDesc><titleStmt><title>
A guide to good practice
</title></titlestmt><publicationStmt><p>
Published by the AHDS, 1999
</publicationStmt><sourceDesc><bibl>
A dual web and print publication
</sourceDesc></fileDesc></teiHeader>
At its simplest a TEI Lite Header requires no more than a description of the electronic file
itself, a description which includes some kind of statement on what the text is called, how it is published, and
if it has been derived or transcribed from another source.
A typical TEI Header would hopefully contain more detailed information relating to a document.
In general the header should be regarded as providing the same kind of information analogous to that provided
by the title page of a printed book, combined with the information usually found in an electronic 'readme' file.
As with the Dublin Core <META> tag, the TEI Header tag appears at the beginning of a text (although it can
be held separately from the document) between the SGML prolog (i.e. the SGML declaration and the DTD) and
the front matter of the text itself:
<!DOCTYPE tei.2 PUBLIC "-//TEI//DTD TEI Lite 1.6//EN"><tei.2>
<teiheader>
[header details go here]
</teiheader>
<text>
<front>
...
</front>
Pgina 33 de 45
<body>
The metadata information contained within the TEI Header can also be utilised as an effective
resource for the information management of texts. In the same way that an online library catalogue allows
different search options and views of a collection, the metadata information in the TEI Header can also be
manipulated to present different access points into a collection of electronic texts. For example, rather than
maintain a separate, static catalogue or database, the holdings of the OTA as recorded in the metadata
information stored in the TEI Headers is used to assist in the identification and retrieval of resources. In
addition to being able to perform simple searches for the author or title of a work, users of the OTA
catalogue can submit complex queries on a number of available options, such as searching for resources by
language, genre, time period, and even by file format.
Additional to its ability to construct indexes and catalogues dynamically, the metadata contained
within the TEI Header can also be used to create other metadata and catalogue records. TEI Header
metadata can be extracted and mapped onto other well-established resource cataloguing standards, such as
library MARC records, or on to emerging standards such as the Dublin Core element set and the Resource
Description Framework (RDF). This is a relatively simple task since the TEI Header was closely modelled on
existing standards in library cataloguing.For example the TEI Lite <author> tag within the <titleStmt> is
analogous to the 100 MARC AUTHOR record field and also to the Dublin Core CREATOR element. There is no
need, therefore, to maintain several different metadata formats when they can simply be filtered from one
central information source.
For more details see (http://www.ukoln.ac.uk/metadata/interoperability/) and
(http://www.hcu.ox.ac.uk/ota/public/publications/metadata/giordano.sgm)
6.2.1: The TEI Lite Header Tag Set
Although the TEI Lite Header has only one required element (the <fileDesc>) it is recommended
that all four of the principal elements which comprise the header be used. The TEI Header provides scope to
describe practically all of the textual and non-textual aspects of an electronic text, so the recommendation
when creating a Header is to include as much information as is possible.
The following overview of the four main elements which go to make up the Header is by no means
exhaustive, a more comprehensive account with examples can be found in the Gentle Introduction to SGML
(see: http://www.hcu.ox.ac.uk/TEI/Lite/teiu5_en.htm)
The four recommended elements which go to make a <teiHeader> are:
<fileDesc>: the file description. This element contains a full bibliographic description of an
electronic file.
<encodingDesc>: the encoding description. This element documents the relationship between an
electronic text and the source(s) from which it was derived.
<profileDesc>: the profile description. This element provides a detailed description of the non-
bibliographic aspects of a text, specifically the languages and sub-languages used, the situation in which it was
produced, the participants and their setting.
<revisionDesc>: the revision description. This element summarises the revision history of a file.
The elements within the TEI Header fall into three broad categories of content:
- Descriptions (containing the suffix Desc) can contain simple prose descriptions of the content
of the element. These can also contain specific sub-elements.
- Statements (containing the suffix Stmt) indicate that the element groups together a number
of specialised elements recording some structured information.
- Declarations (containing the suffix Decl) enclose information about specific encoding practices
applied to the electronic text.
Pgina 34 de 45
The file description: <fileDesc>
The file description contains a full bibliographic description of the computer file itself. It should
provide enough useful information in itself to construct a meaningful bibliographic citation or library catalogue
entry. The <fileDesc> contains three mandatory, and four optional elements:
<titleStmt>: groups information relating to the title of the work and those responsible for its
intellectual content. Details of any major funding or sponsoring bodies can also be recorded here. This element
is mandatory.
<editionStmt>: groups together information relating to one edition of a text. This element may
contain information on the edition or version of the electronic work being documented.
<extent>: simply records the size of the electronic text in a recognisable format, e.g. bytes, Mb,
words, etc.
<publicationStmt>: records details of the publication or distribution details of the electronic text
including a statement on its availability status (e.g. freely available, restricted, forbidden, etc.). This element
is mandatory.
An <idno> is also included to provide a useful mechanism for identifying a bibliographic item by
assigning it one or more unique identifiers.
<seriesStmt>: groups together information about a series, if any, to which a publication belongs.
Again an <idno> element is supplied to help with identifying the unique individual work.
<noteStmt>: groups together any notes providing information about a text additional to that
recorded in other parts of the bibliographic description. This general element can be made use of in a variety
of ways to record potentially significant details about the text and its features which have not already been
accommodated elsewhere in the header.
<sourceDesc>: groups together details of the source or sources from which the electronic edition
was derived. This element may contain a simple prose description of the text or more complex bibliographic
elements may be employed to provide a structured bibliographic reference for the work. This element is
mandatory.
The encoding description: <encodingDesc>
<encodingDesc>: documents the relationship between an electronic text and the source or sources
from which it was derived. The <encodingDesc> can contain a simple prose description detailing such features
as the purpose(s) for which the work was encoded, as well as any other relevant information concerning the
process by which it was assembled or collected. While there are no mandatory elements within the
<encodingDesc>, those available are useful for documenting the rationale behind how and why certain elements
have been implemented.
<projectDesc>: used to describe, in prose, the purpose for which the electronic text was encoded
(for example if a text forms a part of a larger collection, or was created with a particular audience in mind).
<samplingDecl>: useful in identifying the rationale behind the sampling procedure for a corpus.
<editorialDecl>: provides details of the editorial principles applied during the encoding of a text,
for example it can record whether the text has been normalised or how quotations in a text have been
handled.
<tagsDecl>: groups information on how the SGML tags have been used, and how often, within a
text.
<refsDecl>: commonly used to identify which SGML elements contain identifying information, and
whether this information is represented as attribute values or as content.
Pgina 35 de 45
<classDecl>: defines which descriptive classification schemes (if any) have been used by other
parts of the header.
The profile description: <profileDesc>
<ProfileDesc> : The profile description details the non-bibliographic aspects of a text, specifically
the languages used in the text, the situation in which the text was produced, and the participants involved in
the creation.
<creation>: groups information detailing the time and place of creation of a text.
<langUsage>: records the languages (including dialects , sub-languages, etc.). used in the text.
<textClass>: describes the nature or topic of the text in terms of a standard classification
scheme. Included in this element is a useful <keyword> tag which can be used to identify a particular
classification scheme used, and which keywords from this scheme were used.
The revision description: <revisionDesc>
<revisionDesc>: provides a detailed system for recording changes made to the text. This element
is of particular use in the administration of files, recording when changes were made to text and by whom. The
<revisionDesc> should be updated every time a significant alteration has been made to a text.
6.2.2 The TEI Header: Conclusion
The above overview hopefully demonstrates the comprehensive nature of the TEI Header as a
mechanism for documenting electronic texts. The emergence of the electronic text over the past decade has
presented librarians and cataloguers with many new challenges. Existing library cataloguing procedures, while
inadequate to document all the features of electronic texts properly, were used as a secure foundation onto
which additional features directly relevant to the electronic text could be grafted. Chapter Nine of AACR2
(Anglo-American Cataloguing Rules) requires substantial updating and revision, as it assumes that all electronic
texts are published through a publishing company and cannot adequately catalogue texts which are only
published on the Internet. The TEI Header has proved to be an invaluable tool for those concerned with
documenting electronic resources; its supremacy in this field can be measured by the increasing number of
electronic text centres, libraries, and archives which have adopted its framework. The Oxford Text Archive
has found it indispensable as a means of managing its large collection of disparate electronic texts, not only as
a mechanism for creating its searchable catalogue, but as a means of creating other forms of metadata which
can communicate with other information systems.
Ironically it is the same generality and flexibility offered by the TEI Guidelines (P3) on creating
a header which have hindered the progress of one of the main goals of the TEI and the hopes of the electronic
text community as a whole, namely the interoperability and interchangeability of metadata. Unlike the Dublin
Core element set, which has a defined set of rules governing its content, the TEI Header has a set of
guidelines, which allow for widely divergent approaches to header creation. While this is not a major problem
for individual texts, or texts within a single collection, the variant way in which the guidelines are interpreted
and put into practice make easy interoperability with other systems using TEI Headers more difficult than
first imagined. As with the Dublin Core element set, what is required is the wholescale adoption of a mutually
acceptable code of practice which header creators could implement. One final aspect of the TEI Header which
is a cause of irritation to those creating and managing TEI Headers and texts; the apparent dearth of
affordable and user-friendly software aimed specifically at header production. While this has long been a
general criticism of SGML applications as a whole, the TEI can in no way be held to blame for this absence, as
it was not part of the TEI remit to create software. However it has contributed to the relatively slow uptake
and implementation of the TEI Header as the predominant method of providing well structured metadata to
the electronic text community as a whole. Until this situation is adequately resolved the tools on offer tend to
be freeware products designed by people within the SGML community itself, or large and very expensive
purpose-built SGML aware products aimed at the commercial market.
Further reading:
Pgina 36 de 45
The SGML/XML Web Page (http://www.oasis-open.org/cover/sgml-xml.html)
Ebenezer's software suite for TEI
(http://www.umanitoba.ca/faculties/arts/linguistics/russell/ebenezer.htm)
TEI home page (http://www.tei-c.org/
6.3 The Dublin Core Element Set and the Arts and Humanities Data Service
'The Dublin Core is a 15-element metadata element set intended to facilitate discovery of
electronic resources. Originally conceived for author-generated description of web resources, it has also
attracted the attention of formal resource description communities such as museums and libraries'
Dublin Core Metadata home page (http://purl.oclc.org/metadata/dublin_core/)
By the mid-1990s large-scale web users, document creators and information providers had
recognised the pressing need to introduce some kind of workable cataloguing scheme for documenting
resources on the web. The scheme needed to be accessible enough to be adopted and implemented by typical
web content creators who had little or no formal cataloguing training. The set of metadata elements also
needed to be simpler than those used in traditional library cataloguing systems while offering a greater
precision of retrieval than the relatively crude indexing methods employed by existing search engines and web
crawlers.
The Dublin Core Metadata Element Set grew out of a series of meetings and workshops
consisting of experts from the library world, the networking and digital library research community, and other
content specialists.
The basic objectives of the Dublin Core initiative included:
- to produce a core set of descriptive elements which would be capable of describing or
identifying the majority of resources available on the Internet. Unlike a traditional library where the main
focus is on cataloguing published textual materials, the Internet contains a vast range or material in a variety
of formats, including non-textual material such as images or video, most of which have not been 'published' in
any formal way.
- to make this scheme intelligible enough that it could be easily utilised by trained cataloguers
but still retain enough content that it functioned effectively as a catalogue record.
- to encourage the adoption of the scheme on an international level by ensuring that it provided
the best format for documenting digital objects on the web
The Dublin Core element set provides a straightforward framework for documenting features of
a work such as who created the work, what its content is and what languages it contains, where and from whom
it is available, and in what formats, and whether it derived from a printed source. At a basic level the element
set uses commonly understood terms and semantics which are intelligible to most disciplines and information
systems communities. The descriptive terms were chosen to be generic enough to be understood by a
document author, but could also be extended to provide full and precise cataloguing information. For example
textual authors, painters, photographers, writers of software programs can all be considered 'creators' in a
broad sense.
In any implementation of the Dublin Core element set, all elements are optional and repeatable.
Therefore if a work is the result of a collaboration between a number of contributors it is relatively easy to
record the details of each one (name, contact details etc.) as well as their specific contribution or role (author,
editor, photographer, etc.) by simply repeating the appropriate element.
These basic details can be extended by the use of Dublin Core qualifiers. The Dublin Core
initiative originally defined three different kinds of qualifier: type (or sub-element) to broadly refine the
semantics of an element name, language to specify the language of an element value, and scheme to note the
existence of an element value taken from an externally defined scheme or standard. Guidelines for
implementing these qualifiers in HTML are also available. Work on integrating Dublin Core and the Resource
Pgina 37 de 45
Description Framework (RDF), however, revealed that these terms could be the source of confusion. Dublin
Core qualifiers are now identified as either element qualifiers that refine the semantics of a particular
element or value qualifiers that provide contextual information about an element value. Take the Dublin Core
date element, for example. Element qualifiers would allow the broad concept of date to be subdivided into
things like 'date of creation' or 'date of last modification', etc. Value qualifiers might explain how a particular
element value should be parsed. For example, a date element with a value qualifier of 'ISO 8601' indicates
that the string '1999-1-1' should be parsed as the 1st of January 1999. Other value qualifiers might indicate
that an element value is taken from a particular controlled vocabulary or scheme, for example to indicate the
use of a subject term from an established scheme like the Library of Congress Subject Headings.
6.3.1 Implementing the Dublin Core
The Dublin Core element set was designed for documenting web resources and it is easily
integrated into web pages using the HTML <META> tag, inserted between the <HEAD>...</HEAD> tags and
before the <BODY> of the work. An Internet-Draft has been published that explains how this should be done
(http://www.ietf.org/internet-drafts/draft-kunze-dchtml-02.txt). No specialist tools more sophisticated than
an average word processor are required to produce the content of a Dublin Core record; however a number of
labour-saving devices are available, notably the DC-dot generator available from the UKOLN web site
(http://www.ukoln.ac.uk/metadata/dcdot/). DC-dot can automatically generate Dublin Core metadata for a web
site and encode this in HTML <META> tags and other formats. The metadata produced can also be easily
edited and extended further. The Nordic Metadata Project Template is an alternative way of creating simple
Dublin Core metadata that can be embedded in HTML <META> tags (http://www.lub.lu.se/cgi-bin/nmdc.pl).
6.3.2 Conclusions and further reading
The Dublin Core element scheme offers enormous potential as a usable standard cataloguing
procedure for digital resources on the web. The core set of elements are broad and encompassing enough to be
of use to novice web authors and skilled cataloguers alike. However its success will ultimately depend on its
wide-scale adoption by the Internet community as a whole. It is also crucial that the rules of the scheme be
implemented in an intelligent and systematic way. To fulfil this objective, more has to be done to refine and
stabilise the element set. The provision and use of simple Dublin Core generating tools, which demonstrate the
benefits of including metadata, need to become more prevalent.
The Arts and Humanities Data Service (AHDS), in association with the UK Office for Library and
Information Networking (UKOLN), has produced a publication which outlines in more detail the best practices
involved in using Dublin Core, as well as giving many practical examples: Discovering Online Resources across
the Humanities: A practical implementation of the Dublin Core (ISBN 0-9516856-4-3). This publication is also
freely available from the AHDS web site (http://ahds.ac.uk/)
A practical illustration of how the Dublin Core element set can be implemented in order to
perform searches for individual items across disparate collections is the AHDS Gateway
(http://ahds.ac.uk:8080/ahds_live/). The AHDS Gateway is, in reality, an integrated catalogue of the holdings
of the five individual Service Providers, which make up the AHDS. Although the Service Providers are
separated geographically, by providing Dublin Core records describing each of their holdings, users can very
simply search across the complete holdings of the AHDS from one single access point.
6.3.3 The Dublin Core Elements
This set of official definitions of the Dublin Core metadata element set is based on:
http://purl.oclc.org/metadata/dublin_core_elements
Element Descriptions
1. Title
Label: TITLE
The name given to the resource by the CREATOR or PUBLISHER. Where possible standard
authority files should be consulted when entering the content of this element. For example the Library of
Pgina 38 de 45
Congress or British Library title lists can be used, but always remember to indicate the source using the
'scheme' qualifier. If authorities are to be used, these would need to be indicated as a value qualifier
2. Author or Creator
Label: CREATOR
The person or organisation primarily responsible for creating the intellectual content of the
resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the
case of visual resources. Note that this element does not refer to the person who is responsible for digitizing
a work; this belongs in the CONTRIBUTOR element. So in the case of a machine-readable version of King Lear
held by the OTA, the CREATOR remains William Shakespeare, and not the person who transcribed it into
digital form. Again, standard authority files should be consulted for the content of this element.
3. Subject and Keywords
Label: SUBJECT
The topic of the resource. Typically, subject will be expressed as keywords or phrases that
describe the subject or content of the resource. The use of controlled vocabularies and formal classification
schemas is encouraged.
4. Description
Label: DESCRIPTION. A textual description of the content of the resource, including abstracts
in the case of document-like objects or content descriptions in the case of visual resources.
5. Publisher
Label: PUBLISHER
The entity responsible for making the resource available in its present form, such as a publishing
house, a university department, or a corporate entity.
6. Other Contributor
Label: CONTRIBUTOR
A person or organisation not specified in a CREATOR element who has made significant
intellectual contributions to the resource but whose contribution is secondary to any person or organisation
specified in a CREATOR element (for example, editor, transcriber, and illustrator).
7. Date
Label: DATE
The date the resource was made available in its present form. Recommended best practice is an 8
digit number in the form YYYY-MM-DD as defined in http://www.w3.org/TR/NOTE-datetime, a profile of ISO
8601. In this scheme, the date element 1994-11-05 corresponds to November 5, 1994. Many other schema are
possible but, if used, they should be identified in an unambiguous manner.
8. Resource Type
Label: TYPE
The category of the resource, such as home page, novel, poem, working paper, technical report,
essay, dictionary. For the sake of interoperability, TYPE should be selected from an enumerated list that is
under development in the workshop series at the time of publication of this document. See
http://sunsite.berkeley.edu/Metadata/types.html for current thinking on the application of this element
9. Format
Pgina 39 de 45
Label: FORMAT
The data format of the resource, used to identify the software and possibly hardware that
might be needed to display or operate the resource. For the sake of interoperability, FORMAT should be
selected from an enumerated list that is under development in the workshop series at the time of publication
of this document.
10. Resource Identifier
Label: IDENTIFIER
Unique string or number used to identify the resource. Examples for networked resources include
URLs and URNs (when implemented). Other globally-unique identifiers, such as International Standard Book
Numbers (ISBN) or other formal names would also be candidates for this element in the case of off-line
resources.
11. Source
Label: SOURCE
Unique string or number used to identify the work from which this resource was derived, if
applicable. For example, a PDF version of a novel might have a SOURCE element containing an ISBN number for
the physical book from which the PDF version was derived.
12. Language
Label: LANGUAGE
Language(s) of the intellectual content of the resource. Where practical, the content of this
field should coincide with RFC 1766. See: http://info.internet.isi.edu/in-notes/rfc/files/rfc1766.txt
13. Relation
Label: RELATION
The relationship of this resource to other resources. The intent of this element is to provide a
means to express relationships among resources that have formal relationships to others, but exist as discrete
resources themselves. For example, images in a document, chapters in a book, or items in a collection. Formal
specification of RELATION is currently under development. Users and developers should understand that use
of this element is currently considered to be experimental.
14. Coverage
Label: COVERAGE
The spatial and/or temporal characteristics of the resource. Formal specification of COVERAGE
is currently under development. Users and developers should understand that use of this element is currently
considered to be experimental.
15. Rights Management
Label: RIGHTS
A link to a copyright notice, to a rights-management statement, or to a service that would
provide information about terms of access to the resource. Formal specification of RIGHTS is currently under
development. Users and developers should understand that use of this element is currently considered to be
experimental.
Pgina 40 de 45
Chapter 7: Summary
This final chapter is not intended to duplicate material contained elsewhere in this Guide.
Instead, it outlines the ten major steps which make up an ideal electronic text creation project. Of course
readers should bear in mind that, as we live in a far from ideal world, it is usually necessary to revisit some
steps in the process several times over.
Step 1: Sort out the rights
There is absolutely no point in trying to proceed with any kind of electronic text creation project
if you have not obtained appropriate permissions from all those who hold any form of rights in the material
with which you are hoping to work. This can be a tedious and time-consuming process, but time spent now can
save unpleasant and potentially costly legal wrangles later on.
Many archives and libraries will be happy for you to use their material (e.g. in the case of
manuscript sources) provided that they are given appropriate attribution, and perhaps some small recompense
if you intend to create a saleable resource. If you are working from photographs, facsimiles, or microfilm, then
the creators and publishers of these items will also have rights which need to be considered. Similarly, if you
are working from printed sources, you will need to ensure that nothing you are doing will infringe any of the
rights held by the publishers and/or editors (although you may be able to negotiate the necessary permissions
if you have a clear idea how the material will be used). Even if you are working from an electronic text which
you obtained at no cost (e.g. via the web), you should still clarify the rights situation concerning your source
material.
Obtain all permissions in writing rather than relying upon verbal assurances or standard
disclaimers and never assume that people will not bother to sue you. If in doubt, take professional legal
advice and it is worth investigating whether or not your institution already has a dedicated Copyright
Officer or retains specialist legal staff who may be able to offer you some assistance.
Step 2: Assess your material
Refer to the chapters on Document Analysis and Digitization to establish the best way to capture
and represent your source material. At some level this will almost certainly necessitate a degree of
compromise between what you would like to do, and what you are able to do with the knowledge and resources
currently available to you. However, it is important to consider the implications of any decisions taken now, and
to ensure that as far as possible you facilitate the future reuse of your material.
Step 3: Clarify your objectives
This relates to Step 2. The better your sense of how you would like to use your electronic text
(and/or how you envisage others using it), the easier it will be to establish how you should set about creating
it. There is little point in creating lavish high-quality digital images, or richly encoded transcriptions, if all you
wish to do is construct a basic concordance or perform simple computer-assisted linguistic analyses. However,
if you are aiming to produce a flexible electronic edition of your source text one which will support many
kinds of scholarly needs or simply wish to offer users a digital surrogate for the original item, then such an
investment may be worthwhile. You may find it easier to obtain financial support for your efforts if you can
demonstrate that the deliverables will be amenable to multiple uses.
Step 4: Identify the resources available to you and any relevant standards
There are few substitutes for good local advice and support, so consult widely at your host
institution as well as contacting bodies like the AHDS (http://ahds.ac.uk). Remember that for straightforward
tasks such as scanning, OCR, or copy-typing, it may be more cost-effective to employ graduate student labour
on an hourly basis than sub-contract the work to a commercial service, or employ a Research Assistant.
Technical skills date rapidly, and it is rarely worth acquiring them yourself unless they will become central to
your work and you are prepared to update them regularly.
Whenever possible, you should aim to use open or de facto standards as this is the best way to
increase the chances that your digital resource(s) will remain viable in the long term.
Pgina 41 de 45
Step 5: Develop a project plan
Any electronic text creation project is at the mercy of the technology involved, so careful
planning is the key to minimising hold-ups. Consider scheduling a piloting and testing phase to help you resolve
most of the procedural and technical problems. You should also build in a mechanism for on-going quality
control and checking, as mistakes in digital data can be very expensive to correct retrospectively. You should
document all the key decisions and actions at every stage in the project, and ensure that any metadata records
are kept up-to-date and complete.
Step 6: Do the work!
If you have prepared well and carried out each of the previous steps, then this should be the
most straightforward phase of the entire project.
Step 7: Check the results
If you have been conducting quality control checks throughout the data creation process, then
this step should reveal few surprises. However, if absolute fidelity to the original source is of fundamental
importance to your work, it may be worthwhile investing in a separate programme of proof-reading. Simple
checks to ensure that you have captured all your original sources, and that your data have been prepared and
organised as you intended, can identify potentially costly mistakes which are easy to overlook. For example, if
you are creating a series of digital images to create a facsimile edition of a printed work, ensure that any
sequencing of the images matches the pagination of the original analogue source. Similarly, if you are
conducting a computer-assisted analysis of a transcribed text, the omission of a small but vital section could
affect the validity of any results.
Step 8: Test your text
Whether your aim was to produce a data source for secondary analysis, an electronic edition for
use by others or something else entirely you will need to ensure that what you have produced is actually
fit for its intended purpose. You may find that by sharing your work with others, you will gain valuable advice
and guidance upon how the resource could be improved or developed to meet the needs of fellow researchers,
teachers, and learners. Such sharing can be a frustrating process, especially if other people fail to appreciate
why you undertook the work in the first place but often such feedback can dramatically improve the quality
and (re-)usability of a resource, for relatively little extra effort.
Step 9: Prepare for preservation, maintenance, and updating
[Ideally, you should have prepared for this step as part of developing your project plan (Step 5)].
If you have adopted open or de facto standards, then the preservation and maintenance of your data should
present few surprises. If you are depositing your data with another agency (such as the AHDS), or another
part of your institution (e.g. library services), then by following good practice in data creation and
documentation you will have created an electronic resource with excellent prospects for long-term viability.
Updating your data and/or the resulting resource raises several different issues: from technical
matters of version control and how best to indicate to other users that the data/resource may have changed
since last used, to possible sources of continuation funding.
Step 10: Review and share what you have learned
This can be an extremely valuable exercise, which can inform not only your own work and any
future funding bids that you might make, but also those of colleagues working in the same (or comparable)
discipline areas. There are several ways to disseminate information about your experiences, with a number of
humanities computing journals, conferences, and agencies (such as the AHDS and JISC), being keen to ensure
that lessons learned from practical experience are shared throughout the community
Bibliography
ADOBE SYSTEMS. Adobe PostScript Overview [online]. Available from:
http://www.adobe.com/print/postscript/main.html [Accessed 12 Nov 1999].
Pgina 42 de 45
ADOBE SYSTEMS. Adobe PostScript Licensees and Development Partners [online]. Available from:
http://www.adobe.com/print/postscript/oemlist.html [Accessed 12 Nov 1999].
AMERICAN LIBRARY ASSOCIATION (ALA), 1998. Committee on Cataloging: Description and Access Task
Force on Metadata and the Cataloging Rules Final Report [online]. Available from:
http://www.ala.org/alcts/organization/ccs/ccda/tf-tei2.html [Accessed 12 Nov 1999].
APEX DATA SERVICES, INC. Data Conversion Services [online]. Available
from:http://www.apexinc.com/dcs/dcs_index.html [Accessed 12 Nov 1999].
BURNARD, L.D., AND SPERBERG-MCQUEEN, C.M., 1995. TEI Lite: An Introduction to Text Encoding for
Interchange (TEI U5). Available from: http://www.hcu.ox.ac.uk/TEI/Lite/ [Accessed 12 Nov 1999].
CAERE OMNIPAGE. OmniPage Pro 10: Product Factsheet [online]. Available from:
http://www.caere.com/products/omnipage/pro/factsheet.asp [Accessed 12 Nov 1999].
COVER, R.The SGML/XML Web Page [online]. Available from: http://www.oasis-open.org/cover/ [Accessed 12
Nov 1999].
DAY, M., 1997. Extending Metadata for Digital Preservation. Ariadne [online], 9. Available from:
http://www.ariadne.ac.uk/issue9/metadata/ [Accessed 12 Nov 1999].
GASKELL, P., 1995. A New Introduction to Bibliography. Delaware, Oak Knoll Press.
GOLDFARB, C. F., 1990. The SGML Handbook. Oxford: Oxford University Press.
GROVES, P.J. AND LEE, S.D.,1999. 'On-Line Tutorials and Digital Archives' or 'Digitising Wilfred' [online].
Available from: http://info.ox.ac.uk/jtap/reports/index.html [Accessed 12 Nov 1999].
HEERY, R., POWELL, A., AND DAY, M., 1997. Metadata. Library and Information Briefings, 75, 119.
HEWLETT-PACKARD. Choosing a Scanner [online]. Available from:
http://www.scanjet.hp.com/shopping/list.htm [Accessed 12 Nov 1999]
LEE, S.D., 1999. Scoping the Future of Oxford's Digital Collections [online]. Available from:
http://www.bodley.ox.ac.uk/scoping/ [Accessed 12 Nov 1999].
LIBRARY OF CONGRESS. American Memory Project and National Digital Library Program [online]. Available
from: http://lcweb2.loc.gov/ [Accessed 11 Nov 1999].
LYNCH, C., 1997. Searching the Internet. Scientific American [online]. Available from:
http://www.sciam.com/0397issue/0397lynch.html [Accessed 12 Nov 1999].
MILLER, P., 1996. Metadata for the Masses. Ariadne [online], 5. Available from:
http://www.ariadne.ac.uk/issue5/metadata-masses/ [Accessed 12 Nov 1999].
MILLER, P., AND GREENSTEIN, D., 1997. Discovering Online Resources Across the Humanities: A Practical
Implementation of the Dublin Core [online]. Bath: UKOLN. Available from:
http://ahds.ac.uk/public/metadata/discovery.html [Accessed 12 Nov 1999].
OCLC. Cataloging Internet Resources A Manual and Practical Guide (Second Edition) (N.B. Olson, ed.)
[online]. Available from: http://www.purl.org/oclc/cataloging-internet [Accessed 12 Nov 1999].
OCLC. Dublin Core Metadata Initiative [online]. Available from: http://purl.oclc.org/dc/ [Accessed 12 Nov
1999].
PEPPER, S.The Whirlwind Guide to SGML & XML Tools and Vendors [online]. Available
from:http://www.infotek.no/sgmltool/guide.htm [Accessed 12 Nov 1999].
ROBINSON, P., 1993. The Digitization of Primary Textual Sources. Oxford: Office for Humanities
Communication Publications.
Pgina 43 de 45
SEAMAN, D., 1994. Campus Publishing in Standardized Electronic Formats HTML and TEI [online]. Available
from: http://etext.lib.virginia.edu/articles/arl/dms-arl94.html [Accessed on 12 Nov 1999].
SHILLINGSBURG, P.L., 1996. Scholarly Editing in the Computer Age: Theory and Practice. 3rd ed. Ann Arbor:
University of Michigan Press.
SPERBERG-MCQUEEN, C.M., AND BURNARD, L.D. (eds) 1994 (revised 1999). Guidelines for Electronic Text
Encoding and Interchange [online]. Available from: http://www.hcu.ox.ac.uk/TEI/P4beta/ [Accessed 12 Nov
1999].
SPERBERG-MCQUEEN, C.M., AND BURNARD, L.D., 1995a. The Design of the TEI Encoding Scheme. Computers
and the Humanities 29, 1739.
TANSELLE, G.T., 1981. Recent Editorial Discussion and the Central Questions of Editing. Studies in
Bibliography 34, 2365.
TEXT ENCODING INITIATIVE (TEI), 1987. The Poughkeepsie Principles: The Preparation of Text Encoding
Guidelines [online]. Available from: http://www-tei.uic.edu/orgs/tei/info/pcp1.html [Accessed 12 Nov 1999].
TEXT ENCODING INITIATIVE (TEI). The Pizza Chef: a TEI Tag Set Selector [online]. Available from:
http://www.hcu.ox.ac.uk/TEI/newpizza.html [Accessed 12 Nov 1999].
TEXT ENCODING INITIATIVE (TEI). The TEI Consortium Homepage [online]. Available from:http://www.tei-
c.org/ [Accessed 12 Nov 1999].
UKOLN. Metadata [online]. Available from: http://www.ukoln.ac.uk/metadata/ [Accessed 12 Nov 1999].
UNIVERSITY OF VIRGINIA EARLY AMERICAN FICTION PROJECT [online]. Available from:
http://etext.lib.virginia.edu/eaf/intro.html [Accessed 12 Nov 1999].
UNIVERSITY OF VIRGINIA ELECTRONIC TEXT CENTER. Archival Digital Image Creation [online]. Available
from: http://etext.lib.virginia.edu/helpsheets/specscan.html [Accessed 12 Nov 1999].
UNIVERSITY OF VIRGINIA ELECTRONIC TEXT CENTER. Image Scanning: A Basic Helpsheet [online].
Available from: http://etext.lib.virginia.edu/helpsheets/scanimage.html [Accessed 12 Nov 1999].
W3C. HyperText Markup Language Home Page [online]. Available from: http://www.w3.org/MarkUp/ [Accessed
12 Nov 1999].
W3C. Extensible Markup Language (XML) 1.0 [online]. Available from: http://www.w3.org/TR/REC-xml
[Accessed 12 Nov 1999].
W3C. Extensible Stylesheet Language (XSL) Specification [online]. Available from:
(http://www.w3.org/TR/WD-xsl/) [Accessed 12 Nov 1999].
W3C. XHTML 1.0: The Extensible HyperText Markup Language [online]. Available from:
http://www.w3.org/TR/xhtml1 [Accessed 12 Nov 1999].
W3C. XML Schema [online]. Available from: http://www.w3.org/TR/xmlschema-1/ [Accessed 12 Nov 1999].
WILFRED OWEN ARCHIVE [online]. Available from: http://www.hcu.ox.ac.uk/jtap/ [Accessed 12 Nov 1999].
YALE UNIVERSITY LIBRARY PROJECT OPEN BOOK [online]. Available from:
http://www.library.yale.edu/preservation/pobweb.htm [Accessed 12 Nov 1999].
Glossary
AACR2 Anglo-American Cataloguing Rules (2nd ed., 1988 Revision). Rules used in the USA and UK which define
the procedure for creating MARC records.
AHDS The Arts and Humanities Data Service. Online: http://ahds.ac.uk/
Pgina 44 de 45
ASCII American Standard Code for Information Interchange, sometimes also referred to as 'plain text'.
Essentially the basic character set, with minimal formatting (i.e. without changes in font, font size, use of
italics etc.)
Corpus (pl. Corpora) Informally, an collection of data (e.g. whole texts or extracts, transcribed
conversations etc.) selected and organised according to certain principles. For example, a literary corpus
might consist of all the prose works of a particular author, while a linguistic corpus might consist of all the
forms of Russian verbs or examples of conversations amongst British English dialect speakers.
DESIRE Development of a European Service for Information on Research and Education. Online:
http://www.desire.org/
Digitize The process by which a non-digital (i.e. analogue) source is rendered in machine-readable form. Most
often used to describe the process of scanning a text or image using specialist hardware, to create
machine-readable data which can be manipulated by another application (e.g. OCR or image processing
software).
Document Analysis The task of examining the source object (usually a non-electronic text), in order to
acquire an understanding of the work being digitized and what the purpose and future of the project
entails. Document analysis is all about definition defining the document context, defining the document
type, and defining the different document features and relationships. Usually, document analysis should
comprise the first step in any electronic text creation project, and requires users to become intimately
acquainted with the format, structure, and content of their source material.
DTD Document Type Definition. Rules, determined by an application, that apply SGML or XML to the markup
of documents of a particular type.
Dublin Core A metadata element set intended to facilitate discovery of electronic resources.
EAD Encoded Archival Description Document Type Definition (EAD DTD). A non-proprietary encoding
standard for machine-readable finding aids such as inventories, registers, indexes, and other documents
created by archives, libraries, museums, and manuscript repositories to support the use of their holdings.
Online:http://lcweb.loc.gov/ead/
GIF Graphic Interchange Format. GIF files use an older format that is limited to 256 colours. Like TIFFs,
GIFs use a lossless compression format but without requiring as much storage space. While they do not
have the compression capabilities of JPEG, they are strong candidates for graphic art and line drawings.
They also have the capability to be made into transparent GIFs meaning that the background of the
image can be rendered invisible, thereby allowing it to blend in with the background of the web page.
HTML HyperText Markup Language is a non-proprietary format (based upon SGML) for publishing hypertext
on the World Wide Web. It has appeared in four main versions (1.0, 2.0, 3.2, and 4.0) although the World
Wide Web Consortium (W3C) recommends using HTML 4.0. Online: http://www.w3.org/
JPEG Joint Photographic Experts Group. JPEG files are the strongest format for web viewing, and for
transfer through systems with space restrictions. JPEGs are popular with image creators not only for their
compression capabilities but also for their quality. While a TIFF is a lossless compression, JPEGs are a
lossy compression format. This means that as a filesize condenses the image loses bits of information
the information least likely to be noticed by the eye. The disadvantage to this format is precisely what
makes it so attractive: the lossy compression. Once an image is saved, the discarded information is lost.
The implication of this is that the entire image, or certain parts of it, cannot be enlarged. And the more
work done to the image, requiring it to be re-saved, the more information is lost. As there is no way to
retain all of the information scanned from the source, JPEGs are not recommended for archival storage.
Nevertheless, in terms of viewing capabilities and storage size, JPEGs are the best image file format for
online viewing.
MARC MAchine Readable Cataloguing record. Bibliographic record used by libraries which can be processed by
computers.
Markup (n.) Text that is added to the data of a document in order to convey information about it. There are
several kinds of markup, but the two most important are descriptive markup (often represented using
markup tags such as <TITLE>, </H1> etc.), and processing instructions (i.e. the internal instructions required
to change the appearance of a piece of data displayed on screen, start a new page when printing, indicate a
change in font etc.)
Mark up (vb.) To add markup.
Pgina 45 de 45
Metadata Data about data. The additional information used to describe something for a particular purpose
(although that may not preclude its use for multiple purposes). For example, the 'Dublin Core' describes a
set of metadata intended to facilitate the discovery of electronic resources (see http://purl.org/dc/).
OCR Optical Character Recognition. OCR software attempts to recognise the characters on an image of a page
of text, and output a version of that text in machine-readable form. Modern OCR software can be trained
to recognise different fonts, and may use a dictionary to facilitate recognition of certain characters and
words. OCR works best with clean, modern, well-printed text.
OTA The Oxford Text Archive. Online: http://ota.ahds.ac.uk/
PDF Portable Document Format. The native proprietary file format of the Adobe Acrobat family of
products, intended to enable users to exchange and view electronic documents easily and reliably,
independent of the environment in which they were created. Online: http://www.adobe.com/
'Plain Text' See ASCII
PostScript Adobe PostScript is a computer language that describes the appearance of a page, including
elements such as text, graphics, and scanned images, to a printer or other output device. Online:
http://www.adobe.com/print/postscript/main.html
RDF The Resource Description Framework. A foundation for processing metadata; it provides interoperability
between applications that exchange machine-understandable information on the web.
ROADS A set of software tools to enable the set up and maintenance of web based subject gateways. Online
http://www.ilrt.bris.ac.uk/roads/
RTF Rich Text Format. A proprietary file format developed by Microsoft that describes the format and style
of a document (primarily for the purposes of interchange between different applications, most often
common word-processors). Online: http://www.microsoft.com/
SGML The Standard Generalized Markup Language. An International Standard (ISO8879) defining a language
for document representation that formalises markup and frees it of system and processing dependencies.
SGML is the language used to create DTDs. Online: http://www.oasis-open.org/cover/
TEI The Text Encoding Initiative is an international project which in May 1994 issued its Guidelines for the
Encoding and Interchange of Machine-Readable Texts. These Guidelines provide SGML encoding
conventions for describing the physical and logical structure of a large range of text types and features
relevant for research in language technology, the humanities, and computational linguistics. A revised
version of the Guidelines was released in 1999. Online: http://www.hcu.ox.ac.uk/TEI/P4beta/
TEI Lite An SGML DTD which represents a simplified subset of the recommendations set out in the TEI's
Guidelines for the Encoding and Interchange of Machine-Readable Texts. Online:
http://www.hcu.ox.ac.uk/TEI/Lite/
TeX/LaTeX A popular typesetting language (TeX) and a set of macro extensions (LaTeX) the latter being
designed to facilitate descriptive markup. Online: http://www.tug.org/
TIFF Tagged Image File Format. TIFF files are the most widely accepted format for archival image and
master copy creation. TIFFs retain all of the scanned image data, allowing you to gather as much
information as possible from the original. This is reflected in the one disadvantage of the TIFF image
the file size but any type of compression is strongly advised against. Any project that plans to archive
images or call them up for future modification should scan using this format.
UKLON UK Office for Library and Information Networking. A national focus of expertise in network
information management, based at the University of Bath. Online: http://www.ukoln.ac.uk/
Unicode An industry profile of ISO 10646, the Unicode Worldwide Character Standard is a character coding
system designed to support the interchange, processing, and display of the written texts of the diverse
languages of the modern world. In addition, it supports classical and historical texts of many written
languages. Online: http://www.unicode.org./unicode/consortium/consort.html
XML The Extensible Markup Language is a data format for structured document interchange on the Web. The
current World Wide Web Consortium (W3C) Recommendation is XML 1.0, February 1998. Online:
http://www.w3.org/XML/

You might also like