Creating and Documenting Electronic Texts: A Guide to Good Practice by Alan Morrison, michael popham, and Karen wikander. Chapter 1: Introduction. Chapter 2: Document Analysis. Chapter 3: digitization -- scanning, OCR, and re-keying.
Original Description:
Original Title
AHDS-Creating and Documenting Electronic Texts-45páginas.pdf
Creating and Documenting Electronic Texts: A Guide to Good Practice by Alan Morrison, michael popham, and Karen wikander. Chapter 1: Introduction. Chapter 2: Document Analysis. Chapter 3: digitization -- scanning, OCR, and re-keying.
Creating and Documenting Electronic Texts: A Guide to Good Practice by Alan Morrison, michael popham, and Karen wikander. Chapter 1: Introduction. Chapter 2: Document Analysis. Chapter 3: digitization -- scanning, OCR, and re-keying.
Creating and Documenting Electronic Texts: A Guide to Good Practice
by Alan Morrison , Michael Popham , Karen Wikander Chapter 1: Introduction......................................................................................................................... 3 1.1: Aims and organisation of this Guide................................................................................................................ 3 1.2: What this Guide does not cover, and why....................................................................................................... 3 1.3: Opening questions Who will read your text, why, and how? ...................................................................... 4 Chapter 2: Document Analysis ............................................................................................................ 4 2.1: What is document analysis?............................................................................................................................ 4 2.2: How should I start?.......................................................................................................................................... 4 2.2.1: Project objectives ........................................................................................................................................................ 4 2.2.2: Document context........................................................................................................................................................ 5 2.3: Visual and structural analysis.......................................................................................................................... 6 2.4: Typical textual features ................................................................................................................................... 7 Chapter 3: Digitization Scanning, OCR, and Re-keying ................................................................ 8 3.1: What is digitization? ........................................................................................................................................ 8 3.2: The digitization chain....................................................................................................................................... 8 3.3: Scanning and image capture........................................................................................................................... 9 3.3.1: Hardware Types of scanner and digital cameras..................................................................................................... 9 3.3.2: Software .................................................................................................................................................................... 10 3.4: Image capture and Optical Character Recognition (OCR) ............................................................................ 11 3.4.1: Imaging issues........................................................................................................................................................... 11 3.4.2: OCR issues ............................................................................................................................................................... 14 3.5: Re-keying ...................................................................................................................................................... 15 Chapter 4: Markup: The key to reusability........................................................................................ 16 4.1: What is markup? ........................................................................................................................................... 16 4.2: Visual/presentational markup vs. structural/descriptive markup ................................................................... 16 4.2.1: PostScript and Portable Document Format (PDF) ..................................................................................................... 17 4.2.2: HTML 4.0................................................................................................................................................................... 17 4.2.3: User-definable descriptive markup............................................................................................................................. 19 4.3: Implications for long-term preservation and reuse........................................................................................ 19 Chapter 5: SGML/XML and TEI........................................................................................................... 19 5.1: The Standard Generalized Markup Language (SGML) ................................................................................ 19 5.1.1: SGML as metalanguage............................................................................................................................................ 20 5.1.2: The SGML Document ................................................................................................................................................ 21 5.1.3: Creating Valid SGML Documents .............................................................................................................................. 23 5.1.4: XML: The Future for SGML........................................................................................................................................ 24 5.2: The Text Encoding Initiative and TEI Guidelines .......................................................................................... 27 5.2.1: A brief history of the TEI ............................................................................................................................................ 27 5.2.2: The TEI Guidelines and TEI Lite................................................................................................................................ 28 Chapter 6 : Documentation and Metadata......................................................................................... 30 6.1 What is Metadata and why is it important?..................................................................................................... 30 6.1.1: Conclusion and current developments....................................................................................................................... 31 6.2 The TEI Header.............................................................................................................................................. 32 6.2.1: The TEI Lite Header Tag Set ..................................................................................................................................... 33 6.2.2 The TEI Header: Conclusion....................................................................................................................................... 35 6.3 The Dublin Core Element Set and the Arts and Humanities Data Service .................................................... 36 6.3.1 Implementing the Dublin Core .................................................................................................................................... 37 6.3.2 Conclusions and further reading................................................................................................................................. 37 6.3.3 The Dublin Core Elements.......................................................................................................................................... 37 Chapter 7: Summary ........................................................................................................................... 40 Step 1: Sort out the rights..................................................................................................................................... 40 Step 2: Assess your material................................................................................................................................ 40 Step 3: Clarify your objectives.............................................................................................................................. 40 Pgina 2 de 45 Step 4: Identify the resources available to you and any relevant standards........................................................ 40 Step 5: Develop a project plan ............................................................................................................................. 41 Step 6: Do the work!............................................................................................................................................. 41 Step 7: Check the results ..................................................................................................................................... 41 Step 8: Test your text ........................................................................................................................................... 41 Step 9: Prepare for preservation, maintenance, and updating............................................................................. 41 Step 10: Review and share what you have learned............................................................................................. 41 Bibliography ........................................................................................................................................ 41 Glossary............................................................................................................................................... 43 Chapter 1: Introduction 1.1: Aims and organisation of this Guide 1.2: What this Guide does not cover, and why 1.3: Opening questions Who will read your text, why, and how? Chapter 2: Document Analysis 2.1: What is document analysis? 2.2: How should I start? 2.3: Visual and structural analysis 2.4: Typical textual features Chapter 3: Digitization Scanning, OCR, and Re-keying 3.1: What is digitization? 3.2: The digitization chain 3.3: Scanning and image capture 3.4: Image capture and Optical Character Recognition (OCR) 3.5: Re-Keying Chapter 4: Markup: The key to reusability 4.1: What is markup? 4.2: Visual/presentational markup vs. structural/descriptive markup 4.3: Implications for long-term preservation and reuse Chapter 5: SGML/XML and TEI 5.1: The Standard Generalized Markup Language (SGML) 5.2: The Text Encoding Initiative and TEI Guidelines 5.3: Where to find out more about SGML/XML and the TEI Chapter 6 : Documentation and Metadata 6.1 What is Metadata and why is it important? 6.2 The TEI Header 6.3 The Dublin Core Element Set and the Arts and Humanities Data Service Chapter 7: Summary Step 1: Sort out the rights Step 2: Assess your material Step 3: Clarify your objectives Step 4: Identify the resources available to you and any relevant standards Step 5: Develop a project plan Step 6: Do the work! Step 7: Check the results Step 8: Test your text Step 9: Prepare for preservation, maintenance, and updating Step 10: Review and share what you have learned Bibliography Glossary Pgina 3 de 45 Chapter 1: Introduction 1.1: Aims and organisation of this Guide The aim of this Guide is to take users through the basic steps involved in creating and documenting an electronic text or similar digital resource. The notion of 'electronic text' is interpreted very broadly, and discussion is not limited to any particular discipline, genre, language or period although where space permits, issues that are especially relevant to these areas may be drawn to the reader's attention. The authors have tended to concentrate on those types of electronic text which, to a greater or lesser extent, represent a transcription (or, if you prefer, a 'rendition', or 'encoding') of a non-electronic source, rather than the category of electronic texts which are primarily composed of digitized images of a source text (e.g. digital facsimile editions). However, there are a growing number of electronic textual resources which support both these approaches; for example some projects involving the digitization of rare illuminated manuscripts combine high-quality digital images (for those scholars interested in the appearance of the source) with electronic text transcriptions (for those scholars concerned with analysing aspects of the content of the source). We would hope that the creators of every type of electronic textual resource will find something of interest in this short work, especially if they are newcomers to this area of intellectual and academic endeavour. This Guide assumes that the creators of electronic texts have a number of common concerns. For example, that they wish their efforts to remain viable and usable in the long-term, and not to be unduly constrained by the limitations of current hardware and software. Similarly, that they wish others to be able to reuse their work, for the purposes of secondary analysis, extension, or adaptation. They also want the tools, techniques, and standards that they adopt to enable them to capture those aspects of any non-electronic sources which they consider to be significant whilst at the same time being practical and cost-effective to implement. The Guide is organised in a broadly linear fashion, following the sequence of actions and decisions which we would expect any electronic text creation project to undertake. Not every electronic text creator will need to consider every stage, but it may be useful to read the Guide through once, if only to establish the most appropriate course of action for one's own work. 1.2: What this Guide does not cover, and why Creating and processing electronic texts was one of the earliest areas of computational activity, and has been going on for at least half a century. This Guide does not have any pretence to be a comprehensive introduction to this complex area of digital resource creation, but the authors have attempted to highlight some of the fundamental issues which will need to be addressed particularly by anyone working within the community of arts and humanities researchers, teachers, and learners, who may never before have undertaken this kind of work. Crucially, this Guide will not attempt to offer a comprehensive (or even a comparative) overview of the available hardware and software technologies which might form the basis of any electronic text creation project. This is largely because the development of new hardware and software continues at such a rapid pace that anything we might review or recommend here will probably have been superseded by the time this publication becomes available in printed form. Similarly, there would have been little point in providing detailed descriptions of how to combine particular encoding or markup schemes, metadata, and delivery systems, as the needs and abilities of the creators and (anticipated) users of an electronic text should be the major factors influencing its design, construction, and method of delivery. Instead, the authors have attempted to identify and discuss the underlying issues and key concerns, thereby helping readers to begin to develop their own knowledge and understanding of the whole subject of electronic text creation and publication. When combined with an intimate knowledge of the non- electronic source material, readers should be able to decide for themselves which approach and thus which combinations of hardware and software, techniques and design philosophy will be most appropriate to their needs and the needs of any other prospective users. Although every functional aspect of computers is based upon the distinctive binary divide evidenced between 1's and 0's, true and false, presence and absence, it is rarely so easy to draw such clear distinctions at the higher levels of creating and documenting electronic texts. Therefore, whilst reading this Guide it is important to remember that there are seldom 'right' or 'wrong' ways to prepare an electronic text, Pgina 4 de 45 although certain decisions will crucially affect the usefulness and likely long-term viability of the final resource. Readers should not assume that any course of action recommended here will necessarily be the 'best' approach in any or all given circumstances; however everything the authors say is based upon our understanding of what constitutes good practice and results from almost twenty-five years of experience running the Oxford Text Archive (http://ota.ahds.ac.uk). 1.3: Opening questions Who will read your text, why, and how? There are some fundamental questions that will recur throughout this Guide, and all of them focus upon the intended readership (or users) of the electronic text that you are hoping to produce. For example, if your main reason for creating an electronic text is to provide the raw data for computer-assisted analysis perhaps as part of an authorship attribution study then completeness and accuracy of the data will probably be far more important than capturing the visual appearance of the source text. Conversely, if you are hoping to produce an electronic text that will have broad functionality and appeal, and the original source contains presentational features which might be considered worthy of note, then you should be attempting to create a very different object perhaps one where visual fidelity is more important than the absolute accuracy of any transcription. In the former case, the implicit assumption is that no-one is likely to read the electronic text (data) from start to finish, whilst in the second case it is more likely that some readers may wish to use the electronic text as a digital surrogate for the original work. As the nature of the source(s) and/or the intended resource(s) becomes more complex for example recording variant readings of a manuscript or discrepancies between different editions of the same printed text the same fundamental questions remain. The first chapter of this Guide looks at how you might start to address some of these questions, by subjecting your source(s) to a process that the creators of electronic texts have come to call 'Document Analysis'. Chapter 2: Document Analysis 2.1: What is document analysis? Deciding to create an electronic text is just like deciding to begin any other type of construction project. While the desire to dive right in and begin building is tempting, any worthwhile endeavour will begin with a thorough planning stage. In the case of digitized text creation, this stage is called document analysis. Document analysis is literally the task of examining the physical object in order to acquire an understanding about the work being digitized and to decide what the purpose and future of the project entails. The digitization of texts is not simply making groups of words available to an online community; it involves the creation of an entirely new object. This is why achieving a sense of what it is that you are creating is critical. The blueprint for construction will allow you to define the foundation of the project. It will also allow you to recognise any problems or issues that have the potential to derail the project at a later point. Document analysis is all about definition defining the document context, defining the document type and defining the different document features and relationships. At no other point in the project will you have the opportunity to spend as much quality time with your document. This is when you need to become intimately acquainted with the format, structure, and content of the texts. Document analysis is not limited to physical texts, but as the goal of this guide is to advise on the creation of digital texts from the physical object this will be the focus of the chapter. For discussions of document analysis on objects other than text, please refer to such studies as Yale University Library Project Open Book (http://www.library.yale.edu/preservation/pobweb.htm), the Library of Congress American Memory Project and National Digital Library Program (http://lcweb2.loc.gov/) , and Scoping the Future of Oxford's Digital Collections (http://www.bodley.ox.ac.uk/scoping/). 2.2: How should I start? 2.2.1: Project objectives One of the first tasks to perform in document analysis is to define the goals of the project and the context under which they are being developed. This could be seen as one of the more difficult tasks in the document analysis procedure, as it relies less upon the physical analysis of the document and more upon the theoretical positions taken with the project. This is the stage where you need to ask yourself why the document is being encoded. Are you looking simply to preserve a digitized copy of the document in a format Pgina 5 de 45 that will allow an almost exact future replication? Is your goal to encode the document in a way that will assist in a linguistic analysis of the work? Or perhaps there will be a combination of structural and thematic encoding, so that users will be able to perform full-text searches of the document? Regardless of the choice made, the project objectives must be carefully defined, as all subsequent decisions hinge upon them. It is also important to take into consideration the external influences on the project. Often the bodies that oversee digitization projects, either in a funding or advisory capacity, have specific conditions that must be fulfilled. They might for example have markup requirements or standards (linguistic, TEI/SGML, or EAD perhaps) that must be taken into account when establishing an encoding methodology. Also, if you are creating the electronic text for scholarly purposes, then it is very likely that the standards of this community will need to be adhered to. Again, it must be remembered that the electronic version of a text is a distinct object and must be treated as such. Just as you would adhere to a publishing standard of practice with a printed text, so must you follow the standard for electronic texts. The most stringent scholarly community, the textual critics and bibliographers, will have specific, established guidelines that must be considered in order to gain the requisite scholarly authority. Therefore, if you were creating a text to be used or approved by this community their criteria would have to be integrated into the project standards, with the subsequent influence on both the objectives and the creative process taken into account. If the digitization project includes image formats, then there are specific archiving standards held by the electronic community that might have to be met this will not only influence the purchase of hardware and software, but will have an impact on the way in which the electronic object will finally be structured. External conditions are easily overlooked during the detailed analysis of the physical object, so be sure that the standards and policies that influence the outcome of the project are given serious thought, as having to modify the documents retrospectively can prove both detrimental and expensive. This is also a good time to evaluate who the users of your project are likely to be. While you might have personal goals to achieve with the project perhaps a level of encoding that relates to your own area of expertise many of the objectives will relate to your user base. Do you see the work being read by secondary school pupils? Undergraduates? Academics? The general public? Be prepared for the fact that every user will want something different from your text. While you cannot satisfy each desire, trying to evaluate what information might be the most important to your audience will allow you to address the needs and concerns you deem most appropriate and necessary. Also, if there are specific objectives that you wish users to derive from the project then this too needs to be established at the outset. If the primary purpose for the texts is as a teaching mechanism, then this will have a significant influence on how you choose to encode the document. Conversely, if your texts are being digitized so that users will be able to perform complex thematic searches, then both the markup of content and the content of the markup will differ somewhat. Regardless of the decision, be sure that the outcome of this evaluation becomes integrated with the previously determined project objectives. You must also attempt to assess what tools users will have at their disposal to retrieve your document. The hardware and software capabilities of your users will differ, sometimes dramatically, and will most likely present some sort of restriction or limitation upon their ability to access your project. SGML encoded text requires the use of specialised software, such as Panorama, to read the work. Even HTML has tagsets that early browsers may not be able to read. It is essential that you take these variants into consideration during the planning stage. There might be priorities in the project that require accessibility for all users, which would affect the methodology of the project. However, don't let the user limitations stunt the encoding goals for the document. Hardware and software are constantly being upgraded so that although some of the encoding objectives might not be fully functional during the initial stages of the project, they stand a good chance of becoming accessible in the near future. 2.2.2: Document context The first stage of document analysis is not only necessary for detailing the goals and objectives of the project, but also serves as an opportunity to examine the context of the document. This is a time to gather as much information as possible about the documents being digitized. The amount gathered varies from project to project, but in an ideal situation you will have a complete transmission and publication history for the document. There are a few key reasons for this. Firstly, knowing how the object being encoded was created will allow you to understand any textual variations or anomalies. This, in turn, will assist in making informed encoding decisions at later points in the project. The difference between a printer error and an authorial variation not only affects the content of the document, but also the way in which it is marked up. Pgina 6 de 45 Secondly, the depth of information gathered will give the document the authority desired by the scholarly community. A text about which little is known can only be used with much hesitation. While some users might find it more than acceptable for simply printing out or reading, there can be no authoritative scholarly analysis performed on a text with no background history. Thirdly, a quality electronic text will have a TEI header attached (see Chapter 6). The TEI header records all the information about the electronic text's print source. The more information you know about the source, the more full and conclusive your header will be which will again provide scholarly authority. Lastly, understanding the history of the document will allow you to understand its physicality. The physicality of the text is an interesting issue and one on which very few scholars fully agree. Clearly, an understanding of the physical object provides a sense of the format, necessary for a proper structural encoding of the text, but it also augments a contextual understanding. Peter Shillingsburg theorises that the 'electronic medium has extended the textual world; it has not overthrown books nor the discipline of concentrated "lines" of thought; it has added dimensions and ease of mobility to our concepts of textuality' (Shillingsburg 1996, 164). How is this so? Simply put, the electronic medium will allow you to explore the relationships in and amongst your texts. While the physical object has trained readers to follow a more linear narrative, the electronic document will provide you with an opportunity to develop the variant branches found within the text. Depending upon the decided project objectives, you are free to highlight, augment or furnish your users with as many different associations as you find significant in the text. Yet to do this, you must fully understand the ontology of the texts and then be able to delineate this textuality through the encoding of the computerised object. It is important to remember that the transmission history does not end with the publication of the printed document. Tracking the creation of the electronic text, including the revision history, is a necessary element of the encoding process. The fluidity of electronic texts precludes the guarantee that every version of the document will remain in existence, so the responsibility lies with the project creator to ensure that all revisions and developments are noted. While some of the documentation might seem tedious, an electronic transmission history will serve two primary purposes. One, it will help keep the project creator(s) aware of what has developed in the creation of the electronic text. If there are quite a few staff members working on the documents, you will be able to keep track of what has been accomplished with the texts and to check that the project methodology is being followed. Two, users of the documents will be able to see what emendations or regularisations have been made and to track what the various stages of the electronic object were. Again, this will prove useful to a scholarly community, like the textual critics, whose research is grounded in the idea of textual transmission and history. 2.3: Visual and structural analysis Once the project objectives and document context have been established, you can move on to an analysis of the physical object. The first step is to provide the source texts with a classification. Defining the document type is a critical part of the digitization process as it establishes the foundation for the initial understanding of the text's structure. At this point you should have an idea of what documents are going to be digitized for the project. Even if you not sure precisely how many texts will be in the final project, it is important to have a representative sample of the types of documents being digitized. Examine the sample documents and decide what categories they fall under. The structure and content of a letter will differ greatly from that of a novel or poem, so it is critical to make these naming classifications early in the process. Not only are there structural differences between varying document types but also within the same type. One novel might consist solely of prose, while another might be comprised of prose and images, while yet another might have letters and poetry scattered throughout the prose narrative. Having an honest representative sample will provide you with the structural information needed to make fundamental encoding decisions. Deciding upon document type will give you an initial sense of the shape of the text. There are basic structural assumptions that come with classification: looking for the stanzas in poetry or the paragraphs in prose for example. Having established the document type, you can begin to assign the texts a more detailed structure. Without worrying about the actual tag names, as this comes later in the process, label all of the features you wish to encode. For example, if you are digitizing a novel, you might initially break it into large structural units: title page, table of contents, preface, body, back matter, etc. Once this is done you might move on to smaller features: titles, heads, paragraphs, catchwords, pagination, plates, annotations and so forth. One way to keep the naming in perspective is to create a structure outline. This will allow you to see how Pgina 7 de 45 the structure of your document is developing, whether you have omitted any necessary features, or if you have labelled too much. Once the features to be encoded have been decided upon, the relationships between them can then be examined. Establishing the hierarchical sequence of the document should not be too arduous a task especially if you have already developed a structural outline. It should at this point be apparent, if we stick with the example of a novel, that the work is contained within front matter, body matter, and back matter. Within front matter we find such things as epigraphs, prologues, and title pages. The body matter is comprised of chapters, which are constructed with paragraphs. Within the paragraphs can be found quotations, figures, and notes. This is an established and understandable hierarchy. There is also a sequential relationship where one element logically follows another. Using the above representation, if every body has chapters, paragraphs, and notes, then you would expect to find a sequence of <chapter> then <paragraph> then <note>, not <chapter>, <note>, then <paragraph>. Again, the more you understand about the type of text you are encoding, the easier this process will be. While the level of structural encoding will ultimately depend upon the project objectives, this is an opportune time to explore the form of the text in as much detail as possible. Having these data will influence later encoding decisions, and being able to refer to these results will be much easier than having to sift through the physical object at a later point to resolve a structural dilemma. The analysis also brings to light any issues or problems with the physical document. Are parts of the source missing? Perhaps the text has been water damaged and certain lines are unreadable? If the document is a manuscript or letter perhaps the writing is illegible? These are all instances that can be explored at an early stage of the project. While these problems will add a level of complexity to the encoding project, they must be dealt with in an honest fashion. If the words of a letter are illegible and you insert text that represents your best guess at the actual wording then this needs to be encoded. The beauty of document analysis is that by examining the documents prior to digitization you stand a good chance of recognising these issues and establishing an encoding methodology. The benefit of this is threefold: firstly, having identified and dealt with this problem at the start you will have fewer issues arise during the digitization process; secondly, there will be an added level of consistency during the encoding stage and retrospective revision won't be necessary; thirdly, the project will benefit from the thorough level of accuracy desired and expected by the scholarly community. This is also a good time to examine the physical document and attempt to anticipate problems with the digitization process. Fragile spines, flaking or foxed paper, badly inked text, all will create difficulties during the scanning process and increase the likelihood of project delays if not anticipated at an early stage. This is another situation that requires examining representative samples of texts. It could be that one text was cared for in the immaculate conditions of a Special Collections facility while another was stored in a damp corner of a bookshelf. You need to be prepared for as many document contingencies as possible. Problems not only arise out of the condition of the physical object, but also out of such things as typography. OCR digitization is heavily reliant upon the quality and type of fonts used in the text. As will be discussed in greater detail in Chapter 3, OCR software is optimised for laser quality printed text. This means that the older the printed text, the more degradation in the scanning results. These types of problems are critical to identify, as decisions will have to be made about how to deal with them decisions that will become a significant part of the project methodology. 2.4: Typical textual features The final stage of document analysis is deciding which features of the text to encode. Once again, knowing the goals and objectives of the project will be of great use as you try to establish the breadth of your element definition. You have the control over how much of the document you want to encode, taking into account how much time and manpower are dedicated to the project. Once you've made a decision about the level of encoding that will go into the project, you need to make the practical decision of what to tag. There are three basic categories to consider: structure, format and content. In terms of structure there are quite a few typical elements that are encoded. This is a good time to examine the structural outline to determine what skeletal features need to be marked up. In most cases, the primary divisions of text chapters, sections, stanzas, etc. and the supplementary parts paragraphs, lines, pages are all assigned tag names. With structural markup, it is helpful to know how detailed an encoding methodology is being followed. As you will discover, you can encode almost anything in a document, so it will be important to have established what level of markup is necessary and to then adhere to those boundaries. Pgina 8 de 45 The second step is to analyse the format of the document. What appearance-based features need to translate between the print and electronic objects? Some of the common elements relate to attributes such as bold, italic and typeface. Then there are other aspects that take a bit more thought, such as special characters. These require special tags, for example &Aelig; for . However, cases do exist of characters which cannot be encoded and alternate provisions must be made. Format issues also include notes and annotations (items that figure heavily in scholarly texts), marginal glosses, and indentations. Elements of format are easily forgotten, so be sure to go through the representative documents and choose the visual aspects of the text that must be carried through to the electronic object. The third encoding feature concerns document content. This is where you will go through the document looking for features that are neither structural nor format based. This is the point where you can highlight the content information necessary to the text and the user. Refer back to the decisions made about textual relationships and what themes and ideas should be highlighted. If, for example, you are creating a database of author biographies you might want to encode such features as author's name, place of birth, written works, spouse, etc. Having a clear sense of the likely users of the project will make these decisions easier and perhaps more straightforward. This is also a good time to evaluate what the methodology will be for dealing with textual revisions, deletions, and additions either authorial or editorial. Again, it is not so critical here to define what element tags you are using but rather to arrive at a listing of features that need to be encoded. Once these steps have been taken you are ready to move on to the digitization process. Chapter 3: Digitization Scanning, OCR, and Re-keying 3.1: What is digitization? Digitization is quite simply the creation of a computerised representation of a printed analog. There are many methods of digitizing and varied media to be digitized. However, as this guide is concerned with the creation of electronic texts, it will focus primarily on text and images, as these are the main objects in the digitization process. This chapter will address such issues as scanning and image capture, necessary hardware and software concerns, and a more lengthy discussion of digitizing text. For discussions of digitizing other formats, audio and video for example, there are many thorough analyses of procedure. Peter Robinson's The Digitization of Primary Textual Sources covers most aspects of the decision making process and gives detailed explanations of all formats. 'On-Line Tutorials and Digital Archives' or 'Digitising Wilfred', written by Dr Stuart Lee and Paul Groves, is the final report of their JTAP Virtual Seminars project and takes you step by step through the process and how the various digitization decisions were made. They have also included many helpful worksheets to help scope and cost your own project. For a more current study of the digitization endeavour, refer to Stuart Lee's Scoping the Future of Oxford's Digital Collections at http://www.bodley.ox.ac.uk/scoping, which examined Oxford's current and future digitization projects. Appendix E of the study provides recommendations applicable to those outside of the Oxford community by detailing the fundamental issues encountered in digitization projects. While the above reports are extremely useful in laying out the steps of the digitization process, they suffer from the inescapable liability of being tied to the period in which they are written. In other words, recommendations for digitizing are constantly changing. As hardware and software develop, so does the quality of digitized output. The price cuts in storage costs allow smaller projects to take advantage of archival imaging standards (discussed below). This in no way detracts from the importance of the studies produced by scholars such as Lee, Groves, and Robinson; it simply acknowledges that the fluctuating state of digitization must be taken into consideration when project planning. Keeping this in mind, the following sections will attempt to cover the fundamental issues of digitization without focusing on ephemeral discussion points. 3.2: The digitization chain The digitization chain is a concept expounded by Peter Robinson in his aforementioned publication. The idea is based upon the fundamental concept that the best quality image will result from digitizing the original object. If this is not an attainable goal, then digitization should be attempted with as few steps removed from the original as possible. Therefore, the chain is composed of the number of intermediates that come between the original object and the digital image the more intermediates, the more links in the chain (Robinson 1993). This idea was then extended by Dr Lee so that the digitization chain became a circle in which every step of the project became a separate link. Each link attains a level of importance so that if one piece of Pgina 9 de 45 the chain were to break, the entire project would fail (Groves and Lee 1999). While this is a useful concept in project development, it takes us away from the object of this chapter digitization so we'll lean more towards Robinson's concept of the digitization chain. As will soon become apparent with the discussion of imaging hardware and software, having very few links in the digitization chain will make the project flow more smoothly. Regardless of the technology utilised by the project, the results will depend, first and foremost, on the quality of the image being scanned. Scanning a copy of a microfilm of an illustration originally found in a journal is acceptable if it is the only option you have, but clearly scanning the image straight from the journal itself is going to make an immeasurable difference in quality. This is one important reason for carefully choosing the hardware and software. If you know that you are dealing with fragile manuscripts that cannot handle the damaging light of a flatbed scanner, or a book whose binding cannot open past a certain degree, then you will probably lean towards a digital camera. If you have text that is from an 18th-century book, with fading pages and uneven type, you will want the best text scanning software available. Knowing where your documents stand in the digitization chain will influence the subsequent imaging decisions you will make for the project. 3.3: Scanning and image capture The first step in digitization, both text and image, is to obtain a workable facsimile of the page. To accomplish this you will need a combination of hardware and software imaging tools. This is a somewhat difficult area to address in terms of recommending specific product brands, as what is considered industry (or at least the text creation industry) standard is subject to change as technology develops. However, this chapter will discuss some of the hardware and software frequently used by archives and digital project creators. 3.3.1: Hardware Types of scanner and digital cameras There are quite a few methods of image capture that are used within the humanities community. The equipment ranges from scanners (flatbed, sheetfed, drum, slide, microfilm) to high-end digital cameras. In terms of standards within the digitizing community, the results are less than satisfactory. Projects tend to choose the most available option, or the one that is affordable on limited grant funding. However, two of the most common and accessible image capture solutions are flatbed scanners and high-resolution digital cameras. Flatbed scanners Flatbed scanners have become the most commonplace method for capturing images or text. Their name comes from the fact that the scanner is literally a flat glass bed, quite similar to a copy machine, on which the image is placed face down and covered. The scanner then passes light-sensitive sensors over the illuminated page, breaking it into groups of pixel-sized boxes. It then represents each box with a zero or a one, depending on whether the pixel is filled or empty. The importance of this becomes more apparent with the discussion of image type below. As a result of their lowering costs and widespread availability, the use of quality flatbeds ranges from the professional digital archiving projects to the living rooms of the home computer consumer. One benefit of this increased use and availability is that flatbed scanning technology is evolving continually. This has pushed the purchasing standards away from price and towards quality. In an attempt to promote the more expensive product, the marketplace tends to hype resolution and bit-depth, two aspects of scanning that are important to a project (see section 3.4) but are not the only concerns when purchasing hardware. While it is not necessarily the case that you need to purchase the most expensive scanner to get the best quality digital image, it is unlikely that the entry-level flatbeds (usually under 100 pounds/dollars) will provide the image quality that you need. However, while it used to be the case that to truly digitize well you needed to purchase the more high-end scanner, at a price prohibitive to most projects, the advancing digitizing needs of users have pushed hardware developers to create mid-level scanners that reach the quality of the higher range. As a consumer, you need to possess a holistic view of the scanner's capabilities. Not only should the scanner provide you with the ability to create archival quality images (discussed in section 3.4.2) but it should also make the digitization process easier. Many low-cost scanners do not have high-grade lenses, optics, or light sources, thereby creating images that are of a very poor quality. The creation of superior calibre images relates to the following hardware requirements (www.scanjet.hp.com/shopping/list.htm): the quality of the lens, mirrors, and other optics hardware; the mechanical stability of the optical system; Pgina 10 de 45 the focal range and stability of the optical system; the quality of the scanning software and many other hardware and software features. Also, many of the better quality scanners contain tools that allow you to automate some of the procedures. This is extremely useful with such things as colour and contrast where, with the human eye, it is difficult to achieve the exact specification necessary for a high-quality image. Scanning hardware has the ability to provide this discernment for the user, so these intelligent automated features are a necessity to decrease task time. Digital cameras One of the disadvantages of a flatbed scanner is that to capture the entire image the document must lie completely flat on the scanning bed. With books this poses a problem because the only way to accomplish this is to bend the spine to the breaking point. It becomes even worse when dealing with texts with very fragile pages, as the inversion and pressure can cause the pages to flake away or rip. A solution to this problem, one taken up by many digital archives and special collections departments, is to digitize with a stand- alone digital camera. Digital cameras are by far the most dependable means of capturing quality digital images. As Robinson explains, They can digitize direct from the original, unlike the film-based methods of microfilm scanning or Photo CD. They can work with objects of any size or shape, under many different lights, unlike flatbed scanners. They can make images of very high resolution, unlike video cameras (Robinson 1993, 39). These benefits are most clearly seen in the digitization of manuscripts and early printed books objects that are difficult to capture on a flatbed because of their fragile composition. The ability to digitize with variant lighting is a significant benefit as it won't damage the make-up of the work, a precaution which cannot be guaranteed with flatbed scanners. The high resolution and heightened image quality allows for a level of detail you would expect only in the original. As a result of these specifications, images can be delivered at great size. A good example of this is the Early American Fiction project being developed at UVA's Electronic Text Center and Special Collections Department. (http://etext.lib.virginia.edu/eaf/intro.html) The Early American Fiction project, whose goal is the digitization of 560 volumes of American first editions held in the UVA Special Collections, is utilizing digital cameras mounted above light tables. They are working with camera backs manufactured by Phase One attached to Tarsia Technical Industries Prisma 45 4x5 cameras on TTI Reprographic Workstations. This has allowed them to create high quality images without damaging the physical objects. As they point out in their overview of the project, the workflow depends upon the text being scanned, but the results work out to close to one image every three minutes. While this might sound detrimental to the project timeline, it is relatively quick for an archival quality image. The images can be seen at such a high-resolution that the faintest pencil annotations can be read on-screen. Referring back to Robinson's digitization chain (3.2) we can see how this ability to scan directly from the source object prevents the 'degradation' found in digitizing documents with multiple links between original and computer. 3.3.2: Software Making specific recommendations for software programs is a problematic proposition. As has been stated often in this chapter, there are no agreed 'standards' for digitization. With software, as with hardware, the choices made vary from project to project depending upon personal choice, university recommendations, and often budgetary restrictions. However, there are a few tools that are commonly seen in use with many digitization projects. Regardless of the brand of software purchased, the project will need text scanning software if there is to be in-house digitization of text and an image manipulation software package if imaging is to be done. There are a wide variety of text scanning softwares available, all with varying capabilities. The intricacies of text scanning are discussed in greater detail below, but the primary consideration with any text scanning software is how well it works with the condition of the text being scanned. As this software is optimised for laser quality printouts, projects working with texts from earlier centuries need to find a package that has the ability to work through more complicated fonts and degraded page quality. While there is no standard, most projects work with Caere's OmniPage scanning software. In terms of image manipulation, there are more choices depending upon what needs to be done. For image-by- image manipulation, including converting TIFFs to web-deliverable JPEGs and GIFs, Adobe Photoshop is the more common selection. However, when there is a move towards batch conversion, Graphic's DeBabelizer Pro is Pgina 11 de 45 known for its speed and high quality. If the conversion is being done in a UNIX environment, the XV operating system is also a favourite amongst digitization projects. 3.4: Image capture and Optical Character Recognition (OCR) As discussed earlier, electronic text creation primarily involves the digitization of text and images. Apart from re-keying (which is discussed in 3.5), the best method of digitizing text is Optical Character Recognition (OCR). This process is accomplished through the utilisation of scanning hardware in conjunction with text scanning software. OCR takes a scanned image of a page and converts it into text. Similarly, image capture also requires an image scanning software to accompany the hardware. However, unlike text scanning, image capture has more complex requirements in terms of project decisions and, like almost everything else in the digitization project, benefits from clearly thought out objectives. 3.4.1: Imaging issues The first decision that must be made regarding image capture concerns the purpose of the images being created. Are the images simply for web delivery or are there preservation issues that must be considered? The reason for this is simple: the higher quality the image need be, the higher the settings necessary for scanning. Once this decision has been made there are two essential image settings that must be established what type of image will be scanned (greyscale? black and white? colour?) and at what resolution. Image types There are four main types of images: 1-bit black and white, 8-bit greyscale, 8-bit colour and 24- bit colour. A bit is the fundamental unit of information read by the computer, with a single bit being represented by either a '0' or a '1'. A '0' is considered an absence and a '1' is a presence, with more complex representations of information being accommodated by multiple or gathered bits (Robinson 1993, 100). A 1-bit black and white image means that the bit can either be black or white. This is a rarely used type and is completely unsuitable for almost all images. The only amenable image for this format would be printed text or line graphics for which poor resulting quality did not matter. Another drawback of this type is that saving it as a JPEG compressed image one of the most prevalent image formats on the web is not a feasible option. 8-bit greyscale images are an improvement from 1-bit as they encompass 256 shades of grey. It is often used for non-colour images (see the Wilfred Owen Archive at http://www.hcu.ox.ac.uk/jtap/) and provides a clear image rather than the resulting fuzz of a 1-bit scan. While greyscale images are often considered more than adequate, there are times when non-colour images should be scanned at a higher colour because the finite detail of the hand will come through distinctly (Robinson 1993, 28). Also, the consistent recommendation is that images that are to be considered preservation or archival copies should be scanned as 24-bit colour. 8-bit colour is similar to 8-bit greyscale with the exception that each bit can be one of 256 colours. The decision to use 8-bit colour is completely project dependent, as the format is appropriate for web page images but can come out somewhat grainy. Another factor is the type of computer the viewer is using, as older ones cannot handle an image above 8-bit, so it will convert a 24-bit image to the lower format. However, the factor to take into consideration here is primarily storage space. An 8-bit image, while not having the quality of a higher format, will be markedly smaller. If possible, 24-bit colour is the best scanning choice. This option provides the highest quality image, with each bit having the potential to contain one of 16.8 million colours. The arguments against this image format are the size, cost and time necessary. Again, knowing the objectives of the project will assist in making this decision. If you are trying to create archival quality images, this is taken as the default setting. 24-bit colour makes the image look more photo-realistic, even if the original is greyscale. The thing to remember with archival quality imaging is that if you need to go back and manipulate the image in any way, it can be copied and adjusted. However, if you scan the image as a lesser format then any kind of retrospective adjustments will be impossible. While a 24-bit colour archived image can be made greyscale, an 8-bit greyscale image cannot be converted into millions of colours. Resolution The second concern relates to the resolution of the image. The resolution is determined by the number of dots per inch (dpi). The more dots per inch in the file, the more information is being stored about Pgina 12 de 45 the image. Again, this choice is directly related to what is being done with the image. If the image is being archived or will need to be enlarged, then the resolution will need to be relatively higher. However, if the image is simply being placed on a web page, then the resolution drops drastically. As with the choices in image type, the dpi ranges alter the file size. The higher the dpi, the larger the file size. To illustrate the differences, I will replicate an informative table created by the Electronic Text Center, which examines an uncompressed 1" x 1" image in different types and resolutions. Resolution (dpi) 400x400 300x300 200x200 100x100 2-bit black and white 20K 11K 5K 1K 8-bit greyscale or colour 158K 89K 39K 9K 24-bit colour 475K 267K 118K 29K Clearly the 400 dpi scan of a 24-bit colour image is going to be the largest file size, but is also one of the best choices for archival imaging. The 100 dpi image is attractive not only for its small size, but because screen resolution rarely exceeds this amount. Therefore, as stated earlier, the dpi choice depends on the project objectives. File formats If, when using an imaging software program, you click on the 'save as' function to finalise the capture, you will see that there are quite a few image formats to choose from. In terms of text creation there are three types fundamental to the process: TIFF, JPEG, and GIF. These are the most common image formats because they transfer to almost any platform or software system. TIFF (Tagged Image File Format) files are the most widely accepted format for archival image creation and retention as master copy. More so than the following formats, TIFF files can be read by almost all platforms, which also makes it the best choice when transferring important images. Most digitization projects begin image scanning with the TIFF format, as it allows you to gather as much information as possible from the original and then saves these data. This touches on the only disadvantage of the TIFF format the size of the image. However, once the image is saved, it can be called up at any point and be read by a computer with a completely different hardware and software system. Also, if there exists any possibility that the images will be modified at some point in the future, then the images should be scanned as TIFFs. JPEG (Joint Photographic Experts Group) files are the strongest format for web viewing and transfer through systems that have space restrictions. JPEGs are popular with image creators not only for their compression capabilities but also for their quality. While a TIFF is a lossless compression, JPEGs are a lossy compression format. This means that as a filesize condenses, the image loses bits of information. However, this does not mean that the image will markedly decrease in quality. If the image is scanned at 24- bit, each dot has the choice of 16.8 million colours more than the human eye can actually differentiate on the screen. So with the compression of the file, the image loses the information least likely to be noticed by the eye. The disadvantage of this format is precisely what makes it so attractive the lossy compression. Once an image is saved, the discarded information is lost. The implication of this is that the entire image, or certain parts of it, cannot be enlarged. Additionally, the more work done to the image, requiring it to be re- saved, the more information is lost. This is why JPEGs are not recommended for archiving there is no way to retain all of the information scanned from the source. Nevertheless, in terms of viewing capabilities and storage size, JPEGs are the best method for online viewing. GIF (Graphic Interchange Format) files are an older format that are limited to 256 colours. Like TIFFs, GIFs use a lossless compression format without requiring as much storage space. While they don't have the compression capabilities of a JPEG, they are strong candidates for graphic art and line drawings. They also have the capability to be made into transparent GIFs meaning that the background of the image can be rendered invisible, thereby allowing it to blend in with the background of the web page. This is frequently used in web design but can have a beneficial use in text creation. There are instances, as mentioned in Chapter 2, where it is possible that a text character cannot be encoded so that it can be read by a web browser. It could be inline images (a head-piece for example) or the character is not defined by ISOLAT1 or ISOLAT2. When the UVA Electronic Text Center created an online version of the journal Studies in Bibliography, there were instances of inline special characters that simply could not be rendered through the available encoding. As the journal is a searchable full-text database, providing a readable page image was not an option. Their solution to this, one that did not disrupt the flow of the digitized text, was to create a transparent GIF of the image. Pgina 13 de 45 These GIFs were made so that they matched the size of the surrounding text and subsequently inserted quite successfully into the digitized document. Referring back to the discussion on image types, the issue of file size tends to be one that comes up quite often in digitization. It is the lucky project or archive that has an unlimited amount of storage space, so most creators must contemplate how to achieve quality images that don't take up the 55mb of space needed by a 400 dpi, archival quality TIFF. However, it is easy to be led astray by the idea that the lower the bit the better the compression. Not so! Once again, the Electronic Text Center has produced a figure that illustrates how working with 24-bit images, rather than 8-bit, will produce a smaller JPEG along with a higher quality image file. 300 dpi 24-bit colour image: 2.65 x 3.14 inches: uncompressed TIFF: 2188 K 'moderate loss' JPEG: 59 K 300 dpi 8-bit colour image: 2.65 x 3.14 inches: uncompressed TIFF: 729 K 'moderate loss' JPEG: 76 K 100 dpi 24-bit colour image: 2.65 x 3.14 inches: uncompressed TIFF: 249 K 'moderate loss' JPEG: 9 K 100 dpi 8-bit color image: 2.65 x 3.14 inches: uncompressed TIFF: 85 K 'moderate loss' JPEG: 12 K (http://etext.lib.virginia.edu/helpsheets/scanimage.html) While the sizes might not appear to be that markedly different, remember that these results were calculated with an image that measures approximately 3x3 inches. Turn these images into page size, calculate the number that can go into a project, and the storage space suddenly becomes much more of an issue. Therefore, not only does 24-bit scanning provide a better image quality, but the compressed JPEG will take less of the coveted project space. So now that the three image formats have been covered, what should you use for your project? In the best possible situation you will use a combination of all three. TIFFs would not be used for online delivery, but if you want your images to have any future use, either for archiving, later enlarging, manipulation, or printing, or simply as a master copy, then there is no other format in which to store the images. In terms of online presentation, then JPEGs and GIFs are the best method. JPEGs will be of a better calibre and smaller filesize but cannot be enlarged or they will pixelate. Yet in terms of viewing quality their condition will almost match the TIFF. How you use GIFs will depend on what types of images are associated with the project. However, if you are making thumbnail images that link to a separate page which exhibits the JPEG version, then GIFs are a popular choice for that task. In terms of archival digital image creation there seems to be some debate. As the Electronic Text Center has pointed out, there is a growing dichotomy between preservation imaging and archival imaging. Preservation imaging is defined as 'high-speed, 1-bit (simple black and white) page images shot at 600 dpi and stored as Group 4 fax-compressed files' (http://etext.lib.virginia.edu/helpsheets/specscan.html). The results of this are akin to microfilm imaging. While this does preserve the text for reading purposes, it ignores the source as a physical object. Archiving often presupposes that the objects are being digitized so that the source can be protected from constant handling, or as an international means of accessibility. However, this type of preservation annihilates any chance of presenting the object as an artefact. Archiving an object has an entirely different set of requirements. Yet, having said this, there is also a prevalent school of thought in the archiving community that the only imaging that can be considered of archival value is film imaging, which is thought to last at least ten times as long as a digital image. Nonetheless, the idea of archival imaging is still discussed amongst projects and funding bodies and cannot be overlooked. There is no set standard for archiving, and you might find that different places and projects recommend another model. However, the following type, format and resolution are recommended: Pgina 14 de 45 24-bit: There really is little reason to scan an archival image at anything less. Whether the source is colour or greyscale, the images are more realistic and have a higher quality at this level. As the above example shows, the filesize of the subsequently compressed image does not benefit from scanning at a lower bit-size. 600 dpi: This is, once again, a problematic recommendation. Many projects assert that scanning in at 300 or 400 dpi provides sufficient quality to be considered archival. However, many of the top international digitization centres (Cornell, Oxford, Virginia) recommend 600 dpi as an archival standard it provides excellent detail of the image and allows for quite large JPEG images to be produced. The only restrictive aspect is the filesize, but when thinking in terms of archival images you need to try and get as much storage space as possible. Remember, the master copies do not have to be held online, as offline storage on writeable CD-ROMs is another option. TIFF: This should come as no surprise given the format discussion above. TIFF files, with their complete retention of scanned information and cross-platform capabilities are really the only choice for archival imaging. The images maintain all of the information scanned from the source and are the closest digital replication available. The size of the file, especially when scanned at 24-bit, 600 dpi, will be quite large, but well worth the storage space. You won't be placing the TIFF image online, but it is simple to make a JPEG image from the TIFF as a viewing copy. This information is provided with the caveat that scanning technology is constantly changing for the better. It is more than likely that in the future these standards will become pass, with higher standards taking their place. 3.4.2: OCR issues The goal of recognition technology is to re-create the text and, if desired, other elements of the page including such things as tables and layout. Refer back to the concept of the scanner and how it takes a copy of the image by replicating it with the patterns of bits the dots that are either filled or unfilled. OCR technology examines the patterns of dots and turns them into characters. Depending upon the type of scanning software you are using, the resulting text can be piped into many different word processing or spreadsheet programs. Caere OmniPage released version 10.0 in the Autumn of 1999, which boasts the new Predictive Optical Word Recognition Plus+ (POWR++) technology. As the OmniPage factsheet explains, POWR++ enables OmniPage Pro to recognize standard typefaces, without training, from 4 to 72 point sizes. POWR++ recognizes 13 languages (Brazilian Portuguese, British English, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish, and U.S English) and includes full dictionaries for each of these languages. In addition, POWR++ identifies and recognizes multiple languages on the same page (http://www.caere.com/products/omnipage/pro/factsheet.asp). However, OCR software programs (including OmniPage) are very up-front about the fact that their technology is optimised for laser printer quality text. The reasoning behind this should be readily apparent. As scanning software attempts to examine every pixel in the object and then convert it into a filled or empty space, a laser quality printout will be easy to read as it has very clear, distinct, characters on a crisp white background a background that will not interfere with the clarity of the letters. However, once books become the object type, the software capabilities begin to degrade. This is why the first thing you must consider if you decide to use OCR for the text source is the condition of the document to be scanned. If the characters in the text are not fully formed or there are instances of broken type or damaged plates, the software will have a difficult time reading the material. The implications of this are that late 19th and 20th- century texts have a much better chance of being read well by the scanning software. As you move further away from the present, with the differences in printing, the OCR becomes much less dependable. The changes in paper, moving from a bleached white to a yellowed, sometimes foxed, background creates noise that the software must sift through. Then the font differences wreak havoc on the recognition capabilities. The gothic and exotic type found in the hand-press period contrasts markedly with the computer-set texts of the late 20th century. It is critical that you anticipate type problems when dealing with texts that have such forms as long esses, sloping descenders, and ligatures. Taking sample scans with the source materials will help pinpoint some of these digitizing issues early on in the project. While the advantages of exporting text in different word processing formats are quite useful if you are scanning in a document to print or to compensate for an accidentally deleted file, there are a few issues that should take priority with the text creator. Assuming you are using a software program such as Pgina 15 de 45 OmniPage, you should aim for a scan that retains some formatting but not a complete page element replication. As will be explained in greater detail in Chapter 4, when text is saved with formatting that relates to a specific program (Word, WordPerfect, even RTF) it is infused with a level of hidden markup a markup that explains to the software program what the layout of the page should look like. In terms of text creation, and the long-term preservation of the digital object, you want to be able to control this markup. If possible, scanning at a setting that will retain font and paragraph format is the best option. This will allow you to see the basic format of the text I'll explain the reason for this in a moment. If you don't scan with this setting and opt for the choice that eliminates all formatting, the result will be text that includes nothing more than word spacing there will be no accurate line breaks, no paragraph breaks, no page breaks, no font differentiation, etc. Scanning at a mid-level of formatting will assist you if you have decided to use your own encoding. As you proofread the text you will be able to add the structural markup chosen for the project. Once this has been completed the text can be saved out in a text-only format. Therefore, not only will you have the digitized text saved in a way that will eliminate program-added markup, but you will also have a basic level of user-dictated encoding. 3.5: Re-keying Unfortunately for the text creator, there are still many situations where the documents or project preclude the use of OCR. If the text is of a poor or degraded quality, then it is quite possible that the time spent correcting the OCR mistakes will exceed that of simply typing in the text from scratch. The amount of information to be digitized also becomes an issue. Even if the document is of a relatively good quality, there might not be enough time to sit down with 560 volumes of texts (as with the Early American Fiction project) and process them through OCR. The general rule of thumb, and this varies from study to study, is that a best- case scenario would be three pages scanned per minute this doesn't take into consideration the process of putting the document on the scanner, flipping pages, or the subsequent proofreading. If, when addressing these concerns, OCR is found incapable of handling the project digitization, the viable solution is re-keying the text. Once you've made this decision, the next question to address is whether to handle the document in-house or out-source the work. Deciding to digitize the material in-house relies on having all the necessary hardware, software, staff, and time. There are a few issues that come into play with in-house digitization. The primary concern is the speed of re-keying. Most often the re-keying is done by the research assistants working on the project, or graduate students from the text creator's local department. The problem here is that paying an hourly rate to someone re-keying the text often proves more expensive than out-sourcing the material. Also, there is the concern that a single person typing in material tends to overlook keyboarding errors and if the staff member is familiar with the source material, there is a tendency to correct automatically those things that seem incorrect. So while in-house digitization is an option, these concerns should be addressed from the outset. The most popular choice with many digitization projects (Studies in Bibliography, The Early American Fiction Project, Historical Collections for the National Digital Library and the Chadwyck-Healey databases to name just a few) is to out-source the material to a professional keyboarding company. The fundamental benefit most often cited is the almost 100% accuracy rate of the companies. One such company, Apex Data Services, Inc. (used by the University of Virginia Electronic Text Center), promises a conversion accuracy of 99.995%, along with 100% structural accuracy, and reliable delivery schedules. Their ADEPT software allows the dual-keyboarders to witness a real-time comparison, allowing for a single-entry verification cycle (http://www.apexinc.com/dcs/dcs_index.html). Also, by employing keyboarders who do not possess a subject speciality in the text being digitized many, for that matter, often do not speak the language being converted they avoid the problem of keyboarders subconsciously modifying the text. Keyboarding companies are also able to introduce a base-level encoding scheme, established by the project creator, into the documents, thereby eliminating some of the more rudimentary tagging tasks. Again, as with most steps in the text creation process, the answers to these questions will be project dependent. The decisions made for a project that plans to digitize a collection of works will be markedly different from those made by an academic who is creating an electronic edition. It reflects back, as always, to the importance of the document analysis stage. You must recognise what the requirements of the project will be, and what external influences (especially staff size, equipment availability, and project funding) will affect the decision-making process. Pgina 16 de 45 Chapter 4: Markup: The key to reusability 4.1: What is markup? Markup is most commonly defined as a form of text added to a document to transmit information about both the physical and electronic source. Do not be surprised if the term sounds familiar; it has been in use for centuries. It was first used within the printing trade as a reference to the instructions inscribed onto copy so that the compositor would know how to prepare the typographical design of the document. As Philip Gaskell points out, 'Many examples of printers' copy have survived from the hand-press period, some of them annotated with instructions concerning layout, italicization, capitalization, etc.' (Gaskell 1995, 41). This concept has evolved slightly through the years but has remained entwined with the printing industry. G.T. Tanselle writes in a 1981 article on scholarly editing, 'one might...choose a particular text to mark up to reflect these editorial decisions, but that text would only be serving as a convenient basis for producing printer's copy...' (Tanselle 1981, 64). There still seems to be some demarcation between the usage of the term for bibliography and for computing, but the boundary is really quite blurred. The leap from markup as a method of labelling instructions on printer's copy to markup as a language used to describe information in an electronic document is not so vast. Therefore when we think of markup there are really three differing types (two of which will be discussed below). The first is the markup that relates strictly to formatting instructions found on the physical text, as mentioned above. It is used for the creation of an emended version of the document and, with the exception of the work of textual scholars, is rarely referred to again. Then there is the proprietary markup found in electronic document encoding, which is tied to a specific piece of software or developer. This markup is concerned primarily with document formatting, describing what words should be in italics or centred, where the margins should be set, or where to place a bulleted list. There are a few things to note about this type of markup. The first is that being proprietary means that it is intimately tied to the software that created it. This does not pose a problem as long as the document will only remain within that software program; and as long as the creator recognises that in the future there is no guarantee that the software will exist. This is important, as proprietary software formats allow users to say where and how they want the document formatted, but then the software inserts its own markup language to accomplish this. When users create documents in Word or a PDF file, they are unconsciously adding encoding with every keystroke. As anyone who has created a document in one software format and attempted to transfer it to another is aware, the encoding does not transfer and if for some reason a bit of it does, it rarely means the same thing. The third type of markup is non-proprietary, a generalised markup language. There are two critical distinctions between this markup and the previous two. Firstly, as it is a general language and not tied to a specific software/hardware, it offers cross-platform capabilities. This ensures that documents utilising this style of encoding will be readable many years down the line. Secondly, while a generalised markup language, as with the others, allows users to insert formatting markup in the document, it also allows for encoding based upon the content of the work. This is a level of control not found in the previous styles of markup. Here the user is able not only to describe the appearance of the document but the meanings found within it. This is a critical aspect of electronic text creation, and therefore receives more in-depth treatment below. 4.2: Visual/presentational markup vs. structural/descriptive markup The discussion of visual/presentational markup vs. structural/descriptive markup carries on from the concepts of proprietary and non-proprietary markup. As the name implies, presentational markup is concerned with the visual structure of a text. Depending upon what processing software is being used, the markup explains to the computer how the document should appear. So if the work should be seen in 12 point, Tahoma font, the software dictates a markup so that this happens. Presentational markup is concerned with structure only insofar as it relates to the visual aspect of the document. It does not care whether a heading is for a book, a chapter or a paragraph the only consideration is how that heading should look on the page. Most proprietary language formats tend to focus solely on presentational issues. To move into descriptive markup would require that the software provide the document creator with the ability to formulate their own tags with which to encode the structure and presentation of the work. In other words, descriptive markup relates less to the visual strategy of the work and more to the reasons behind the structure. It allows the creator to encode the document with a markup that more Pgina 17 de 45 clearly shows how the presentation, configuration, and content relate to the document as a whole. Once again, the beneficial effects of thorough document analysis can be seen. Having a holistic sense of the document, having the detailed listing of critical elements in the document, will exemplify how descriptive markup advances a project. In this case, a non-proprietary language will be the most beneficial, as it will allow the document creators to arrive at their own tagsets, providing a much needed level of control over the encoding development. 4.2.1: PostScript and Portable Document Format (PDF) In 1985, Adobe Systems created a programming language for printers called PostScript. In so doing, they produced a system that allowed computers to 'talk' to their printers. This language describes for the printer the appearance of the page, incorporating elements like text, graphics, colour, and images, so that documents maintain their integrity through the transmission from computer to printer. PostScript printers have become industry standard with corporations, marketers, publishing companies, graphic designers, and more. Printers, slide recorders, imagesetters all these output devices utilise PostScript technology. Combine this with PostScript's multiple operating system capability and it becomes clear why PostScript has become the standard for printing technology. (http://www.adobe.com/print/features/psvspdf/main.html). PostScript language can be found in most printers Epson, IBM, and Hewlett-Packard just to name a few almost guaranteeing that a high standard of printing can be found in both the home and office. Adobe provides a list of compatible products at http://www.adobe.com/print/postscript/oemlist.html. Portable Document Format (PDF) was created by Adobe in 1993 to complement their PostScript language. PDF allows the user to view a document with a presentational integrity that almost resembles a scanned image of the source. This delivery of visually rich content is the most attractive use of PDF. The format is entirely concerned with keeping the document intact, and, to ensure this, allows any combination of text, graphics and images. It also has full, rich colour presentation and is therefore often used with corporate and marketing graphic arts materials. Another enticing feature, depending on the quality of the printer, is that when a PDF file is printed out, the hard copy output is an exact replication of the screen image. PDF is also desirable for its delivery strengths. Not only does the document maintain its visual integrity, but it also can be compressed. This compression eases on-line and CD-ROM transmission and assists its archiving opportunities. PDF files can be read through an Acrobat Reader application that is freely available for download via the web. This application is also capable of serving as a browser plug-in for online document viewing. Creating PDF files is a bit more complicated than the viewing procedure. To write a PDF document it is necessary to purchase Adobe software. PDFWriter allows the user to create the PDF document, and the more expensive Adobe Capture program will convert TIFF files into PDF formatted text versions. If the user would like the document to become more interactive, offering the ability to annotate the document for example, then this functionality can be added with the additional purchase of Acrobat Exchange, which serves an editorial function. Exchange allows the user to annotate and edit the document, search across documents and also has plug-ins that provide highlighting ability. Taking into consideration the earlier discussion of visual vs. structural markup, it is clear how programs like PostScript and PDF fall into the category of a proprietary processing language concerned with presentational rather than descriptive markup. This does not imply that these languages should be avoided. On the contrary, if the only concern is how the document appears both on the screen and through the printer, then software of this nature is appropriate. However, if the document needs to cross platforms or the project objectives require control over the encoding or document preservation, then these proprietary programs are not dependable. 4.2.2: HTML 4.0 HyperText Markup Language (or HTML as it is commonly known) is a non-proprietary format markup system used for publishing hypertext on the World Wide Web. To date, it has appeared in four main versions (1.0, 2.0, 3.2, 4.0), with the World Wide Web Consortium (W3C) recommending 4.0 as the markup language of choice. HTML is a derivative of SGML the Standard Generalised Markup Language. SGML will be discussed in greater detail in Chapter 5, but suffice it to say that it is an international standard metalanguage that defines a set of rules for device-independent, system-independent methods of encoding electronic texts. SGML allows you to create your own markup language but provides the rules necessary to ensure its processing Pgina 18 de 45 and preservation. HTML is a successful implementation of the SGML concepts, and, as a result, is accessible to most browsers and platforms. Along with this, it is a relatively simple markup language to learn, as it has a limited tagset. HTML is by far the most popular web-publishing language, allowing users to create online text documents that include multimedia elements (such as images, sounds, and video clips), and then put these documents in an environment that allows for instant publication and retrieval. There are many advantages to a markup language like HTML. As mentioned above, the primary benefit is that a document encoded with HTML can be viewed in almost any browser an extremely attractive option for a creator who wants documents which can be viewed by an audience with varied systems. However, it is important to note that while the encoding can cross platforms, there are consistently differences in page appearance between browsers. While W3C recommends the usage of HTML 4.0, many of its features are simply not available to users with early versions of browsers. Unlike PDF which is extremely concerned with keeping the document and its format intact, HTML has no true sense of page structure and files can neither be saved nor printed with any sense of precision. Besides the benefit of a markup language that crosses platforms with ease, HTML attracts its many users for the simple manner with which it can be mastered. For users who do not want to take the time to learn the tagset, the good news is that conversion-to-HTML tools are becoming more accessible and easier to use. For those who cannot even spare the time to learn how to use HTML-creation software, of which there are a limited quantity, they can sit down with any text creation program (Notepad for example) and author an HTML document. Then by using the 'Open File'; tool in a browser, the document can immediately be viewed. What this means for novice HTML authors is that they can sit down with a text creator and a browser and teach themselves a markup language in one session. And as David Seaman, Director of the Electronic Text Center at the University of Virginia, points out: [this] has a real pedagogical value as a form of SGML that makes clear to newcomers the concept of standardized markup. To the novice, the mass of information that constitutes the Text Encoding Initiative Guidelines the premier tagging scheme for most humanities documents is not easily grasped. In contrast, the concise guidelines to HTML that are available on-line (and usually as a "help" option from the menu of a Web client) are a good introduction to some of the basic SGML concepts. (Seaman 1994). This is of real value to the user. The notion of marking up a text is quite often an overwhelming concept. Most people do not realise that markup enters into their life every time they make a keystroke in a word processing program. So for the uninitiated, HTML provides a manageable stepping-stone into the world of more complex encoding. Once this limited tagset is mastered, many users find the jump into an extended markup language less intimidating and more liberating. However, one of the drawbacks to this easy authoring language is that many of the online documents are created without a DTD. A DTD is the abbreviation for a document type definition, which outlines the formal specifications for an SGML encoded document. Basically, a DTD is the method for spelling out the SGML rules that the document is following. It sets the standards for what markup can be used and how this markup interacts with others. So, for example, if you create an HTML document with a specific software program, say HoTMetaL PRO, the resulting text will begin with a document type declaration stating which DTD is being used. A sample declaration from a HoTMetaL creation looks like this: <!DOCTYPE HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19970714::extensions to HTML 4.0//EN" "hmpro4.dtd"> As can be seen in the above statement, the declaration explains that the document will follow the HoTMetaL PRO 4.0 DTD. In so doing, the markup language used must adhere to the rules set out in this specific DTD. If it does not then the document cannot be successfully validated and will not work. As it stands now, web browsers require neither a DTD nor a document type declaration. Browsers are notoriously lax in their HTML requirements, and unless something serious is missing from the encoded document it will be successfully viewed through a Web client. The impact of this is that while HTML provides a convenient and universal markup language for a user, many of the documents floating out in cyberspace are permeated with invalid code. The focus then moves away from authoring documents that conform to a set of encoding guidelines and towards the creation of works that can be viewed in a browser (Seaman 1994). This problem will become more severe with the increased use of Extensible Markup Language, or XML as it is more commonly known. This markup language, which is being lauded as the new lingua franca, combines the visual Pgina 19 de 45 benefits of HTML with the contextual benefits of SGML/TEI. However, while XML will have the universality of HTML, the web clients will require a more stringent adherence to markup rules. While documents that comply with the rules of an HTML DTD will find the transition relatively simple, the documents that were constructed strictly with viewing in mind will require a good deal of clean up prior to conversion. This is not to say that HTML is not a useful tool for creating online documents. As in the case of PostScript and PDF, the choice to use HTML should be document dependent. It is the perfect choice for static documents that will have a short shelf-life. If you are creating course pages or supplementary materials regarding specific readings that will not be necessary or available after the end of term, then HTML is an appropriate choice. If, however, you are concerned about presentational and structural integrity, the markup of document content and/or the long-term preservation of the text, then a user-definable markup language is a much better choice. 4.2.3: User-definable descriptive markup A user-definable descriptive markup is exactly what its name implies. The content of the markup tags is established solely by the user, not by the software. As a result of SGML and its concept of a DTD, a document can have any kind of markup a creator desires. This frees the document from being married to proprietary hardware or software and from its reliance upon an appearance-based markup language. If you decide to encode the document with a non-proprietary language, which we highly recommend, then this is a good time to evaluate the project goals. While a user-definable markup language gives you control over the content of the markup, and thereby more control over the document, the markup can only be fully understood by you. Although not tied to a proprietary system, it is also not tied to any accepted standard. A markup language defined and implemented by you is simply that a personal non-proprietary markup system. However, if the electronic texts require a language that is non-proprietary, more extensive and content-oriented than HTML, and comprehensible and acceptable to a humanities audience, then there is a solution the Text Encoding Initiative (TEI). TEI is an international implementation of SGML, providing a non- proprietary markup language that has become the de facto standard in Humanities Computing. TEI, which is explained more fully in Chapter 5, provides 'a full set of tags, a methodology, and a set of Document Type Descriptions (DTDs) that allow the detailed (or not so detailed) description of the spatial, intellectual, structural, and typographic form of a work' (Seaman 1994). 4.3: Implications for long-term preservation and reuse Markup is a critical, and inescapable, part of text creation and processing. Regardless of the method chosen to encode the document, some form of markup will be included in the text. Whether this markup is proprietary or non-proprietary, appearance- or content-based is up to you. Be sure to evaluate the project goals when making the encoding decisions. If the project is short-lived or necessarily software dependent, then the choices are relatively straightforward. However, if you are at all concerned about long- term preservation, cross-platform capabilities, and/or descriptive markup, then a user-definable (preferably TEI) markup language is the best choice. As Peter Shillingsburg corroborates: ...the editor with a universal encoding system developing an electronic edition with a multiplatform application has created a tool available to anyone with a computer and has ensured the longevity of the editorial work through generations to come of software and hardware. It seems worth the effort (Shillingsburg 1996, 163). Chapter 5: SGML/XML and TEI The previous chapter showed what markup is, and how it plays a crucial role in almost every aspect of information processing. Now we shall learn about some crucial applications of descriptive markup which are ideally suited to the types of texts studied by those working in the arts and humanities disciplines. 5.1: The Standard Generalized Markup Language (SGML) The late 1970s and early 1980s saw a consensus emerging that descriptive markup languages had numerous advantages over other types of text encoding. A number of products and macro languages appeared which were built around their own descriptive markup languages and whilst these represented a step forward, they were also constrained by the fact that users were required to learn a new markup language each Pgina 20 de 45 time, and could only describe those textual features which the markup scheme allowed (sometimes extensions were possible, but implementing them was rarely a straightforward process). The International Standards Organisation (ISO) also recognised the value of descriptive markup schemes, and in 1986 an ISO committee released a new standard called ISO 8879, the Standard Generalized Markup Language (SGML). This complex document represented several years' effort by an international committee of experts, working together under the Chairmanship of Dr Charles Goldfarb (one of the creators of IBM's descriptive markup language, GML). Since SGML was a product of the International Standards process, the committee also had the benefit of input from experts from the numerous national standards bodies associated with the ISO, such as the UK's British Standards Institute (BSI). 5.1.1: SGML as metalanguage A great deal of largely unjustified mystique surrounds SGML. You do not have to look very hard to find instances of SGML being described as 'difficult to learn', 'complex to implement', or 'expensive to use', when in fact it is none of these things. People all too frequently confuse the acronym, SGML, with SGML applications many of which are indeed highly sophisticated and complex operations, designed to meet the rigorous demands of blue chip companies working in major international industries (automotive, pharmaceutical, or aerospace engineering). It should not be particularly surprising that a documentation system designed to control and support every aspect of the tens of thousands of pages of documentation needed to build and maintain a battleship, fix the latest passenger aircraft, or supplement a legal application for international recognition for a new advanced drug treatment, should appear overwhelmingly complex to an outsider. In fact, despite its name, SGML is not even a markup language. Instead, it would be more appropriate to call SGML a 'metalanguage'. In a conventional markup language, such as HTML, users are offered a pre-defined set of markup tags from which they must make appropriate selections; if they suddenly introduce new tags which are not part of the HTML specification, then it is clear that the resulting document will not be considered valid HTML, and it may be rejected or incorrectly processed by HTML software (e.g. an HTML-compatible browser). SGML, on the other hand, does not offer a pre-defined set of markup tags. Rather, it offers a grammar and specific vocabulary which can be used to define other markup languages (hence 'metalanguage'). SGML is not constrained to any one particular type of application, and it is neither more nor less suited to producing technical documentation and specifications in the semiconductor industry, than it is for marking up linguistic features of ancient inscribed tablets of stone. In fact, SGML can be used to create a markup language to do pretty well anything, and that is both its greatest strength and weakness. SGML cannot be used 'out-of-the-box', so to speak, and because of this it has earned an undeserved reputation in some quarters as being troublesome and slow to implement. On the other hand, there are many SGML applications (and later we shall learn about one in particular), which can be used straightaway, as they offer a fully documented markup language which can be recognised by any one of a suite of tools and implemented with a minimum of fuss. SGML provides a mechanism for like-minded people with a shared concern to get together and define a common markup language which satisfies their needs and desires, rather than being limited by the vision of the designers of a closed, possibly proprietary markup scheme which only does half the job. SGML offers another advantage in that it not only allows (groups of) users to define their own markup languages, it also provides a mechanism for ensuring that the rules of any particular markup language can be rigorously enforced by SGML-aware software. For example, within HTML, although there are six different levels of heading defined (e.g. the tags <H1> to <H6>) there is no requirement that they should be applied in a strictly hierarchical fashion; in other words, it is perfectly possible for a series of headings in an HTML document to be marked up as <H1>, then <H3>, followed by <H5>, followed in turn by <H2>, <H4>, and <H6> all to achieve a particular visual appearance in a particular HTML browser. By contrast, should such a feature be deemed important, an SGML-based markup language could be written in such a way that suitable software can ensure that levels of heading nest in a strictly hierarchical fashion (and the strength of this approach can perhaps become even more evident when encoding other kinds of hierarchical structure, e.g. a <BOOK> must contain one or more <CHAPTER>s, each of which must in turn contain one or more <PARAGRAPH>s, and so on). We shall learn more about this in the following section. Pgina 21 de 45 There is one final, crucial, difference between SGML-based markup languages and other descriptive markup languages: the process by which International Standards are created, maintained, and updated. ISO Standards are subject to periodic formal review, and each time this work is undertaken it happens in full consultation with the various national standards bodies. The Committee which produced SGML has guaranteed that if and when any changes are introduced to the SGML standard, this will be done in such a way as to ensure backwards compatibility. This is not a decision which has been undertaken lightly, and the full implications can be inferred from the fact that commercial enterprises rarely make such an explicit commitment (and even when they do, users ought to reflect upon the likelihood that such a commitment will actually be fulfilled given the considerable pressures of a highly competitive marketplace). The essential difference has been characterised thus: the creators of SGML believe that a user's data should belong to that user, and not be tied up inextricably in a proprietary markup system over which that user has no control; whereas, the creators of a proprietary markup scheme can reasonably be expected to have little motivation to ensure that data encoded using their scheme can be easily migrated to, or processed by, a competitor's software products. 5.1.2: The SGML Document The SGML standard gives a very rigid definition as to what constitutes an SGML document. Whilst there is no need for us to consider this definition in detail at this stage, it is worthwhile reviewing the major concepts as they offer a valuable insight into some crucial aspects of an electronic text. Perhaps first and foremost amongst these is the notion that an SGML document is a single logical entity, even though in practice that document may be composed of any number of physical data files, spread over a storage medium (e.g. a single computer's hard-disk) or even over different types of storage media connected together via a network. As today's electronic publications become more and more complex, mixing (multilingual) text with images, audio, and image data, it reinforces the need to ensure that they are created in line with accepted standards. For example, an article from an electronic journal mounted on a website may be delivered to the end-user in the form of a single HTML document, but that article (and indeed the whole journal), may rely upon dozens or hundreds of data files, a database to manage the entire collection of files, several bespoke scripts to handle the interfacing between the web and the database, and so on. Therefore, whenever we talk about an electronic document, it is vitally important to remember that this single logical entity may, in fact, consist of many separate data files. SGML operates on the basis of there being three major parts which combine to form a single SGML document. Firstly, there is the SGML declaration, which specifies any system and software constraints. Secondly, there is the prolog, which defines the document structure. Lastly, there is the document instance, which contains what one would ordinarily think of as the document. Whilst this may perhaps appear unnecessarily complicated, in fact it provides an extremely valuable insight into the key components which are essential to the creation of an electronic document. The SGML declaration tells any software that is going to process an SGML document all that it should need to know. For example, the SGML declaration specifies which character sets have been used in the document (normally ASCII or ISO 646, but more recently this could be Unicode, or ISO 10646). It also establishes any constraints on system variables (e.g. the length of markup tag names, or the depth to which tags can be nested), and states whether or not any of SGML's optional features have been used. The SGML standard offers a default set-up, so that, for example, the characters < and > are used to delimit markup tag names and with the widespread acceptance of HTML, this has become the accepted way to indicate markup tags but if for any reason this presented a problem for a particular application (e.g. encoding a lot of data in which < and > were heavily used to indicate something else), it would be possible to redefine the delimiters as @ or #, or whatever characters were deemed to be more appropriate. The SGML declaration is important for a number of reasons. Although it may seem an unduly complicated approach, it is often these fundamental system or application dependencies which make it so difficult to move data around between different software and hardware environments. If the developers of wordprocessing packages had started off by agreeing on a single set of internal markup codes they would all use to indicate a change in font, the centring of a line of text, the occurrence of a pagebreak etc., then users' lives would have been made a great deal easier; however, this did not happen, and hence we are left in a situation where data created in one application cannot easily be read by another. We should also remember that as our reliance upon information technology grows, time passes, applications and companies appear or go Pgina 22 de 45 bust, there may be data which we wish to exchange or reuse which were created when the world of computing was a very different place. It is a very telling lesson that although we are still able to access data inscribed on stone tablets or committed to papyrii or parchment hundreds (if not thousands) of years ago, we already have masses of computer-based data which are effectively lost to us because of technological progress, the demise of particular markup schemes, and so on. Furthermore, by supplying a default environment, the average end- user of an SGML-based encoding system is unlikely to have to familiarise him- or herself with the intricacies of the SGML declaration. Indeed it should be enough simply to be aware of the existence of the SGML declaration, and how it might affect one's ability to create, access, or exploit a particular source of data. The next major part of an SGML document is the prolog, which must conform to the specification set out in the formal SGML standard, and the syntax given in the SGML declaration. Although it is hard to discuss the prolog without getting bogged down in the details of SGML, suffice it to say that it contains (at least one) document type declaration, which in turn contains (or references) a Document Type Definition (or DTD). The DTD is one of the single most important features of SGML, and what sets it apart from not to say above other descriptive markup schemes. Although we shall learn a little more about the process in the following section, the DTD contains a series of declarations which define the particular markup language which will be used in the document instance, and also specifies how the different parts of that language can interrelate (e.g. which markup tags are required and optional, the contexts in which they can be used, and so on). Often, when people talk about 'using SGML', they are actually talking about using a particular DTD, which is why some of the negative comments that have been made about SGML (e.g. 'It's too difficult.', or 'It doesn't allow me to encode those features which I consider to be important') are erroneous, because such complaints should properly be directed at the DTD (and thus aimed at the DTD designer) rather than at SGML in general. Other than some of the system constraints imposed by the SGML declaration, there are no strictures imposed by the SGML standard regarding how simple or complex the markup language defined in the DTD should be. Whilst the syntax used to write a DTD is fairly straightforward, and most people find that they can start to read and write DTDs with surprising ease, to create a good DTD requires experience and familiarity with the needs and concerns of both data creators and end-users. A good DTD nearly always reflects a designer's understanding of all these aspects, an appreciation of the constraints imposed by the SGML standard, and a thorough process of document analysis (see Chapter 2) and DTD-testing. In many ways this situation is indicative of the fact that the creators of the SGML standard did not envisage that individual users would be very likely to produce their own DTDs for highly specific purposes. Rather, they thought (or perhaps hoped), that groups would form within industry sectors or large-scale enterprises to produce DTDs that were tailored to the needs of their particular application. Indeed, the areas in which the uptake of SGML has been most enthusiastic have been operating under exactly those sorts of conditions for example, the international Air Transport Authority seeking to standardise aircraft maintenance documentation, or the pharmaceutical industry's attempts to streamline the documentary evidence needed to support applications to the US Food and Drug Administration. As we shall see, the DTD of prime importance to those working within the Arts and Humanities disciplines has already been written and documented by the members of the Text Encoding Initiative, and in that case the designers had the foresight to build in mechanisms to allow users to adapt or extend the DTD to suit their specific purposes. However, as a general rule, if users wish to write their own DTDs, or tweak an SGML declaration, they are entirely free to do so (within the framework set out by the SGML standard) but the vast majority of SGML users prefer to rely upon an SGML declaration and DTD created by others, for all the benefits of interoperability and reusability promised by this approach. This brings us to the third main part of an SGML document: namely, the document instance itself. This is the part of the document which contains a combination of raw data and markup, and its contents are constrained by both the SGML declaration, and the contents of the prolog (especially the declarations in the DTD). Clearly from the perspective of data creators and end-users, this is the most interesting part of an SGML document and it is common practice for people to use the term 'SGML document' when they are actually referring to a document instance. Such confusion should be largely unproblematic, provided these users always remember that when they are interchanging data (i.e. a document instance) with colleagues, they should also pass on the relevant DTD and SGML declaration. In the next section we shall investigate the practical steps involved in the creation of an SGML document, and the very valuable role that can be played by SGML-aware software. Pgina 23 de 45 5.1.3: Creating Valid SGML Documents How you create SGML documents will be greatly influenced by the aims of your project, the materials you are working with, and the resources available to you. For the purposes of this discussion, let us start by assuming that you have a collection of existing non-electronic materials which you wish to turn into some sort of electronic edition. If you have worked your way through the chapter on document analysis (Chapter 2), then you will know what features of the source material are important to you, and what you will want to be able to encode with your markup. Similarly, if you have considered the options discussed in the chapter on digitization (Chapter 3), you will have some idea of the type of electronic files with which you will be starting to work. Essentially, if you have chosen to OCR the material yourself, you will be using clear or plain ASCII text files, which will need to undergo some sort of editing or translation as part of the markup process. Alternatively, if the material has been re-keyed, then you will either have electronic text files which already contain some basic markup, or you will also have plain ASCII text files. Having identified the features you wish to encode, you will need to find a DTD which meets your requirements. Rather than trying to write your own DTD from scratch, it is usually worthwhile investing some time to look around for existing public DTDs which you might be able to adopt, extend, or adapt to suit your particular purposes. There are many DTDs available in the public domain, or made freely available for others to use (e.g. see Robin Cover's The SGML/XML Web Page (http://www.oasis-open.org/cover/)), but even if none of these match your needs, some may be worth investigating to see how others have tackled common problems. Although there are some tools available which are designed to facilitate the process of DTD-authoring, they are probably only worth buying if you intend to be doing a great deal of work with DTDs, and they can never compensate for poor document analysis. However, if you are working with literary or linguistic materials, you should take the time to familiarise yourself with the work of the Text Encoding Initative (see 5.2: The Text Encoding Initiative and TEI Guidelines), and think very carefully before rejecting use of their DTD. Before we go any further, let us consider two other scenarios: one where you already have the material in electronic form but you need to convert it to SGML; the other, where you will need to create SGML from scratch. Once again, there are many useful tools available to help convert from one markup scheme to another, but if your target format is SGML this may have some bearing on the likelihood of success (or failure) of any conversion process. As we have seen, SGML lends itself most naturally to a structured, hierarchical view of a document's content (although it is perfectly possible to represent very loose organisational structures, and even non-hierarchical document webs, using SGML markup) and this means that it is much simpler to convert from a proprietary markup scheme to SGML if that scheme also has a strong sense of structure (i.e. adopts a descriptive markup approach) and has been used sensibly. However, if a document has been encoded with a presentational markup scheme which has, for example, used codes to indicate that certain words should be rendered in an italic font regardless of the fact that sometimes this has been for emphasis, at other times to indicate book and journal titles, and elsewhere to indicate non-English words then this will dramatically reduce the chances of automatically converting the data from this presentation-oriented markup scheme into one which complies with an SGML DTD. It is probably worth noting at this point that these conversion problems primarily apply when converting from a non-descriptive, non-SGML markup language into SGML; the opposite process, namely converting from SGML into another target markup scheme, is much more straightforward (because it would simply mean that data variously marked-up with, say, <EMPHASIS>, <TITLE>, and <FOREIGN> tags, had their markup converted into the target scheme's markup tags for <ITALIC>). It is also worth noting that such a conversion might not be a particularly good idea, because you would effectively be throwing information away. In practice it would be much more sensible to retain the descriptive/SGML version of your material, and convert to a presentational markup scheme only when absolutely required for the successful rendering of your data on screen or on paper. Indeed, many dedicated SGML applications support the use of stylesheets to offer some control over the on-screen rendition of SGML-encoded material, whilst preserving the SGML markup behind the scenes. If you are creating SGML documents from scratch, or editing existing SGML documents (perhaps the products of a conversion process, or the results of a re-keying exercise) there are several factors to consider. It is essential that you have access to a validating SGML parser, which is a software program that Pgina 24 de 45 can read an SGML declaration and a document's prolog, understand the declarations in the DTD, and ensure that the SGML markup used throughout the document instance conforms appropriately. In many commercial SGML- and XML-aware software packages, a validating parser is included as standard and is often very closely integrated with the relevant tools (e.g. to ensure that any simple editing operations, such as cut and paste, do not result in the document failing to conform to the rules set out in the DTD because markup has been inserted or removed inappropriately). It also possible to find freeware and public domain software which have some understanding of the markup rules expressed in the DTD, while also allowing users to validate their documents with a separate parser in order to guarantee conformance. Your choice will probably be dictated by the kind of software you currently use (e.g. in the case of editors: windows-based office-type applications, or unix-style plain text editors?), the budget you have available, and the files with which you will be working. Whatever your decision, it is important to remember that a parser can only validate markup against the declarations in a DTD, and it cannot pick up semantic errors (e.g. incorrectly tagging a person's name as, say, a place name, or an epigraph as if it were a subtitle). So for the purposes of creating valid SGML documents, we have seen that there are a number of tools which you may wish to consider. If you already have files in electronic form, you will need to investigate translation or auto-tagging software and if you have a great many files of the same type, you will probably want software which supports batch processing, rather than anything which requires you to work on one file at a time. If you are creating SGML documents from scratch, or cleaning-up the output of a conversion process, you will need some sort of editor (ideally one that is SGML-aware), and if your editor does not incorporate a parser, you will need to obtain one that can be run as a stand-alone application (there are one or two exceptionally good parsers freely available in the public domain). For an idea of the range of SGML and XML tools available, readers should consult Steve Pepper's The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm). Producing valid SGML files which conform to a DTD, is in some respects only the first stage in any project. If you want to search the files for particular words, phrases, or marked-up features, you may prefer to use an SGML-aware search engine, but some people are perfectly happy writing short scripts in a language like Perl. If you want to conduct sophisticated computer-assisted text analysis of your material, you will almost certainly need to look at adapting an existing tool, or writing your own code. Having obtained your SGML text, whether as complete documents or as fragments resulting from a search, you will need to find some way of displaying it. You might choose to simply convert the SGML markup in the data into another format (e.g. HTML for display in a conventional web browser), or you might use one of the specialist SGML viewing packages to publish the results which is how many commercial SGML-based electronic texts are produced. We do not have sufficient space to consider all the various alternatives in this publication, but once again you can get an idea of the options available by looking at the The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm) or, more generally, The SGML/XML Web Page (http://www.oasis-open.org/cover/). 5.1.4: XML: The Future for SGML As we saw in the previous section, an SGML-based markup language usually offers a number of advantages over other types of markup scheme, especially those which rely upon proprietary encoding. However, although SGML has met with considerable success in certain areas of publishing and many commercial, industrial, and governmental sectors, its uptake by the academic community has been relatively limited (with the notable exception of the Text Encoding Initiative, see 5.2: The Text Encoding Initiative and TEI Guidelines, below). We can speculate on why this might be so for example, SGML has an undeserved reputation for being difficult and expensive to produce because it imposes prohibitive intellectual overheads, and because the necessary software is lacking (leastways at prices academics can afford). While it is true that peforming a thorough document analysis and developing a suitable DTD should not be undertaken lightly, it could be argued that to approach the production of any electronic text without first investing such intellectual resources is likely to lead to difficulties (either in the usefulness or the long-term viability of the resulting resource). The apparent lack of readily available, easy-to-use SGML software, is perhaps a more valid criticism yet the resources have been available for those willing to look, and then invest the time necessary to learn a new package (although freely available software tends to put more of an onus on the user than some of the commercial products). However, what is undoubtedly true is the fact that writing a piece of SGML software (e.g. a validating SGML parser), which fully implements the SGML standard, is an extremely demanding task and this has been reflected in the price and sophistication of some commercial applications. Pgina 25 de 45 Whilst SGML is probably more ubiquitous than many people realise, HTML the markup language of the World Wide Web is much better known. Nowadays, the notion of 'the Web' is effectively synonymous with the global Internet, and HTML plays a fundamental role in the delivery and presentation of information over the Web. The main advantage of HTML is that it is a fixed set of markup tags designed to support the creation of straightforward hypertext documents. It is easy to learn and easy for developers to implement in their software (e.g. HTML editors and browsers), and the combination of these factors has played a large part in the rapid growth and widespread acceptance of the Web. There is so much information about HTML already available, that there is little to be gained from going into much detail here however, readers who wish to know more should visit the W3C's HyperText Markup Language Home Page (http://www.w3.org/MarkUp/). Although HTML was not originally designed as an application of SGML, it soon became one once the designers realised the benefits to be gained from having a DTD (e.g. a validating parser could be used to ensure that markup had been used correctly, and so the resulting files would be easier for browsers to process). However, this meant that the HTML DTD had to be written retrospectively, and in such a way that any existing HTML documents would still conform to the DTD which in turn meant that the value of the DTD was effectively diminished! This situation led to the release of a succession of different versions of HTML, each with their own slightly different DTD. Nowadays, the most widely accepted release of HTML is probably version 3.2, although the World Wide Web Consortium (W3C) released HTML 4.0 on 18th December 1997 in order to address a number of outstanding concerns about the HTML standard. Future versions of HTML are probably unlikely, although there is work going on within the HTML committees of W3C to take into account other developments within the W3C, and this has led to proposals such as the XHTML 1.0 Proposed Recommendation document released on 24th August 1999 (see http://www.w3.org/TR/1999/PR-xhtml1- 19990824/). It is perfectly possible to deliver SGML documents over the Web, but there are several ways that this can be achieved and each has different implications. In order to retain the full 'added-value' of the SGML markup, you might choose to deliver the raw SGML data over the Web and rely upon a behind-the- scenes negotation between your web-server and the client's browser to ensure that an appropriate SGML- viewing tool is launched on the client's machine. This enables the end-user to exploit fully the SGML markup included in your document, provided that s/he has been able to obtain and install the appropriate software. Another possibility would be to offer a Web-to-SGML interface on your server, so that end-users can access your documents using an ordinary Web browser whilst all the processing of the SGML markup takes place on the server, and the results are delivered as HTML. Alternatively, you might decide to simply convert the markup into HTML from whatever SGML DTD has been used to encode the document (either on-the-fly, or as part of a batch process) so that the end-user can use an ordinary Web browser and the server will not have to undertake any additional processing. The last of these options, while placing the least demands on the end- user, effectively involves throwing away all the extra intellectual information that is represented by the SGML encoding; for example, if in your original SGML document, proper nouns, place names, foreign words, and certain types of emphasis have each been encoded with different markup according to your SGML DTD, they may all be translated to <EM> tags in HTML and thus any automatically identifiable distinction between these different types of content will probably have been lost. The first option retains the advantages of using SGML, whilst placing a significant onus on the end-user to configure his Web browser correctly to launch supporting applications. The second option represents a middle way: exploiting the SGML markup whilst delivering easy-to-use HTML, but with the disadvantage of having to do much more sophisticated processing at the Web server. Until recently, therefore, those who create and deliver electronic text were confronted with a dilemma: to use their own SGML DTD with all the additional processing overheads that entails, or use an HTML DTD and suffer a diminution of intellectual rigour and descriptive power? Extending HTML was not an option for individuals and projects, because the developers of Web tools were only interested in supporting the flavours of HTML endorsed by the W3C. Meanwhile, delivering electronic text marked-up according to another SGML DTD meant that end-users were obliged to obtain suitable SGML-aware tools, and very few of them seemed willing to do this. One possible solution to this dilemma is the Extensible Markup Language (XML) 1.0 (see http://www.w3.org/TR/REC-xml), which became a W3C Recommendation (the nearest thing to a formal standard) on 10th February 1998. Pgina 26 de 45 The creators of XML adopted the following design goals: 1. XML shall be straightforwardly usable over the Internet. 2. XML shall support a wide variety of applications. 3. XML shall be compatible with SGML. 4. It shall be easy to write programs which process XML documents. 5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. 6. XML documents should be human-legible and reasonably clear. 7. The XML design should be prepared quickly. 8. The design of XML shall be formal and concise. 9. XML documents shall be easy to create. 10. Terseness in XML markup is of minimal importance. They sought to gain the generic advantages offered by supporting arbitrary SGML DTDs, whilst retaining much of the operational simplicity of using HTML. To this end, they 'threw away' all the optional features of the SGML standard which make it difficult (and therefore expensive) to process. At the same time they retained the ability for users to write their own DTDs, so that they can develop markup schemes which are tailored to suit particular applications but which are still enforceable by a validating parser. Perhaps most importantly of all, the committee which designed XML had representatives from several major companies which develop software applications for use with the Web, particularly browsers, and this has helped to encourage a great deal of interest in XML's potential. SGML has its roots in a time when creating, storing, and processing information on computer was expensive and time-consuming. Many of the optional features supported by the SGML standard were intended to make it cheaper to create and store SGML-conformant documents in an era when it was envisaged that all the markup would be laboriously inserted by hand, and megabytes of disk space were extremely expensive. Nowadays, faster and cheaper processors, and the falling costs of storage media (both magnetic and optical), mean that the designers and users of applications are less worried about the concerns of SGML's original designers. On the other hand, the ever growing volume of electronic information makes it all the more important that any markup which has been used has been applied in a thoroughly consistent and easy to process manner, thereby helping to ensure that today's applications perform satisfactorily. XML addresses these familiar concerns, whilst taking advantage of modern computer systems and the lessons learned from using SGML. For example, now that the cost of storing data is of less concern to most users (except for those dealing with extremely large quantities of data), there is no need to offer support for markup short-cuts which, while saving storage space, tend to impose an additional load when information is processed. Instead, XML's designers were able to build-in the concept of 'well-formed' data, which requires that any marked-up data are explicitly bounded by start- and end-tags, and that all the tagged data in a document nest appropriately (so that it becomes possible, say, to generate a document tree which captures the hierarchical arrangement of all the data elements in the document). This has the added advantage that when two applications (such as a database and a web server) need to exchange data, they can use well-formed XML as their interchange format, because both the sending and receiving application can be certain that any data they receive will be appropriately marked-up and there can be no possible ambiguity about where particular data structures start and end. XML takes this approach one stage further by adopting the SGML concept of DTDs, such that an XML document is said to be 'valid' if it has an associated DTD and the markup used in the document has been checked (by a validating parser) against the declarations expressed in that DTD. If an application knows that it will be handling valid XML, and has an understanding of and access to the relevant DTD, this can greatly improve its ability to process that data for example, a search and retrieval application would be able to Pgina 27 de 45 construct a list of all the marked-up data structures in the document, so that a user could refine the search criteria accordingly. Knowing that a vast collection of XML documents have all been validated against a particular DTD will greatly assist the processing of that collection, as valid XML data is also necessarily well- formed. By contrast, while it is possible for an XML application to process a well-formed document such that it can derive one possible DTD which could represent the data structures it contains, that DTD may not be sufficient to represent all the well-formed XML documents of the same type. There are clearly many advantages to be gained from creating and using valid XML data, but the option remains to use well-formed XML data in those situations where it would be appropriate. Today's Web browsers expect to receive conformant HTML data, and any additional markup included in the data which is not recognised by the browser is usually ignored. The next generation of Web browsers will know how to handle XML data, and while all of them will know how to process HTML data by default, they will also be prepared to cope with any well-formed or valid XML data that they receive. This offers the opportunity for groups of users to come together, agree upon a DTD they wish to adopt, and then create and exchange valid XML data which conforms to that DTD. Thus, a group of academics concerned with the creation of electronic scholarly editions of major texts could all agree to prepare their data in accordance with a particular DTD which enabled them to markup the features of the texts which they felt to be appropriate for their work. They could then exchange the results of their labours safe in the knowledge that they could all be correctly processed by their favourite software (whether browsers, editors, text analysis tools, or whatever). Readers who wish to explore the similarities and differences between SGML and XML are advised to consult the sources mentioned on Robin Cover's The SGML/XML Web Page (http://www.oasis- open.org/cover/). Projects which have invested heavily in the creation of SGML-conformant resources are well-placed to take advantage of XML developments, because any conversions that are required should be straightforward to implement. However, it is important to bear in mind that at the moment XML is just one of a suite of emerging standards, and it may be a little while yet before the situation becomes completely clear. For example the Extensible Stylesheet Language (XSL) Specification (http://www.w3.org/TR/WD-xsl/) for expressing stylesheets as XML documents is still under development, as are the proposals to develop XML Schema (http://www.w3.org/TR/xmlschema-1/ ), which may ultimately replace the role of DTDs when creating XML documents (and provide support not just for declaring data structures, but also for strong data typing such that it would be possible to ensure, say, that the contents of a <DATE> element conformed to a particular international standard date format). 5.2: The Text Encoding Initiative and TEI Guidelines 5.2.1: A brief history of the TEI (Much of the following text is extracted from publicly available TEI documents, and is reproduced here with minor amendments and the permission of the TEI Editors.) The TEI began with a planning conference convened by the Association for Computers and the Humanities (ACH), gathering together over thirty experts in the field of electronic texts, representing professional societies, research centers, and text and data archives. The planning conference was funded by the U.S. National Endowment for the Humanities (NEH &endash; an independent federal agency) and took place at Vassar College, Poughkeepsie, New York on 1213 November 1987. Those attending the conference agreed that there was a pressing need for a common text encoding scheme that researchers could use when creating electronic texts, to replace the existing system in which every text provider and every software developer had to invent and support their own scheme (since existing schemes were typically ad hoc constructs with support for the particular interests of their creators, but not built for general use). At a similar conference ten years earlier, one participant pointed out, everyone had agreed that a common encoding scheme was desirable, and predicted chaos if one was not developed. At the Poughkeepsie meeting, no one predicted chaos: everyone agreed that chaos has already arrived. After two days of intense discussion, the participants in the meeting reached agreement on the desirability and feasibility of creating a common encoding scheme for use both in creating new documents and in exchanging existing documents among text and data archives; the closing statement the Poughkeepsie Pgina 28 de 45 Principles (see http://www-tei.uic.edu/orgs/tei/info/pcp1.html) enunciated precepts to guide the creation of such a scheme. After the planning conference, the task of developing an encoding scheme for use in creating electronic texts for research was undertaken by three sponsoring organisations: the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). Each sponsoring organisation named representatives to a Steering Committee, which was responsible for the overall direction of the project. Furthermore, a number of other interested professional societies were involved in the project as participating organisations, and each of these named a representative to the TEI Advisory Board. With support from NEH and later from the Commission of the European Communities and the Andrew W. Mellon Foundation, the TEI began the task of developing a draft set of Guidelines for Electronic Text Encoding and Interchange. Working committees, comprising scholars from all over North America and Europe, drafted recommendations on various aspects of the problem, which were integrated into a first public draft (document TEI P1), which was published for public comment in June 1990. After the publication of the first draft, work began immediately on its revision. Fifteen or so specialised work groups were assigned to refine the contents of TEI P1 and to extend it to areas not yet covered. So much work was produced that a bottleneck ensued getting it ready for publication, and the second draft of the Guidelines (TEI P2) was released chapter by chapter from April 1992 through November 1993. During 1993, all published chapters were revised yet again, some other necessary materials were added, and the development phase of the TEI came to its conclusion with the publication of the first 'official' version of the Guidelines the first one not labelled a draft in May 1994 (Sperberg-McQueen and Burnard 1994). Since that time, the TEI has concentrated on making the Guidelines (TEI P3) more accessible to users, teaching workshops and training users, and on preparing ancillary material such as tutorials and introductions. 5.2.2: The TEI Guidelines and TEI Lite The goals outlined in the Poughkeepsie Principles (see http://www- tei.uic.edu/orgs/tei/info/pcp1.html) were elaborated and interpreted in a series of design documents, which recommended that the Guidelines should: suffice to represent the textual features needed for research be simple, clear, and concrete be easy for researchers to use without special purpose software allow the rigorous definition and efficient processing of texts provide for user-defined extensions conform to existing and emergent standards As the product of many leading members of the research community, it is perhaps not surprising that research needs are the prime focus of the TEI's Guidelines. The TEI established a plethora of work groups covering everything from 'Character Sets' and 'Manuscripts and Codicology', to 'Historical Studies'and 'Machine Lexica' in order to ensure that the interests of the various sectors of the arts and humanities research community were adequately represented. As one of the co-editors of the Guidelines, Michael Sperberg-McQueen wrote, 'Research work requires above all the ability to define rigorously (i.e. precisely, unambiguously, and completely) both the textual objects being encoded and the operations to be performed upon them. Only a rigorous scheme can achieve the generality required for research, while at the same time making possible extensive automation of many text-management tasks.' (Sperberg-McQueen and Burnard 1995, 18). As we saw in the previous section (5.1 The Standard Generalized Markup Language), SGML offers all the necessary techniques to define and enforce a formal grammar, and so it was chosen as the basis for the TEI's encoding scheme. Pgina 29 de 45 The designers of the TEI also had to decide how to reconcile the need to represent the textual features required by researchers, with their other expressed intention of keeping the design simple, clear, and concrete. They concluded that rather than have many different SGML DTDs (i.e. one for each area of research), they would develop a single DTD with sufficient flexibility to meet a range of scholars' needs. They began by resolving that wherever possible, the number of markup elements should not proliferate unnecessarily (e.g. have a single <NOTE> tag with a TYPE attribute to say whether it was a footnote, endnote, shouldernote etc., rather than having separate <FOOTNOTE>, <ENDNOTE>, <SHOULDERNOTE> tags). Yet as this would still result in a large and complex DTD, they also decided to implement a modular design grouping sets of markup tags according to particular descriptive functions so that scholars could choose to mix and match as many or as few of these markup tags as they required. Lastly, in order to meet the needs of those scholars whose markup requirements could not be met by this comprehensive DTD, they designed it in such a way that the DTD could be adapted or extended in a standard fashion, thereby allowing these scholars to operate within the TEI framework and retain the right to claim compliance with the TEI's Guidelines. There is no doubt that the TEI's DTD and Guidelines can appear rather daunting at first, especially if one is unfamiliar with descriptive markup, text encoding issues, or SGML/XML applications. However, for anyone seriously concerned about creating an electronic textual resource which will remain viable and usable in the 'long-term' (which can be less than a decade in the rapidly changing world of information technology), the TEI's approach certainly merits very serious investigation, and you should think very carefully before deciding to reject the TEI's methods in favour of another apparent solution. The importance of the modularity and extensibility of the TEI's DTD cannot be over-stated. In order to make their design philosophy more accessible to new users of text encoding and SGML/XML, the creators of the TEI's DTD have developed what they describe as the 'Chicago pizza model' of DTD construction. Every Chicago (indeed, U.S.) pizza must have certain ingredients in common namely, cheese and tomato sauce; pizza bases can be selected from a pre-determined limited range of types (e.g. thin-crust, deep- dish, or stuffed), whilst pizza toppings may vary considerably (from a range of well-known ingredients, through to local specialities or idiosyncratic preferences!). In the same way every implementation of the TEI DTD must have certain standard components (e.g. header information and the core tag set), one of the eight base tag sets (see below), to which can then be added any combination of the additional tag sets or user-defined application-specific extensions. TEI headers are discussed in more detail in Chapter 6, whilst the core tag set consists of common elements which are not specific to particular types of text or research application (e.g. the <P> tag used to identify paragraphs). Of the eight base tag sets, six are designed for use with texts of one particular type (i.e. prose, verse, drama, transcriptions of spoken material, printed dictionaries, and terminological data), whilst the other two (general, and mixed) allow for anthologies or unrestricted mixing of the other base types. The additional tag sets (the pizza toppings) provide the necessary markup tags for describing such things as hypertext linking, the transcription of primary sources (especially manuscripts), critical apparatus, names and dates, language corpora, and so on. Readers who wish to know more should consult the full version of the Guidelines, which are also available online at http://www.hcu.ox.ac.uk/TEI/P4beta/. Even the brief description given above is probably enough to indicate that while the TEI scheme offers immense descriptive possibilities, its application is not something to be undertaken lightly. With this in mind, the designers of the TEI DTD developed a couple of pre-built versions of the DTD, of which the best known and most widely used is called 'TEI Lite'. Each aspect of the TEI Lite DTD is documented in TEI Lite: An Introduction to Text Encoding for Interchange (Burnard and Sperberg-McQueen 1995), which is also available online at http://www.hcu.ox.ac.uk/TEI/Lite/. The abstract of this document states that TEI Lite 'can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different types of computer systems'. Indeed, many people find that the TEI Lite DTD is more than adequate for their purposes, but even for those who do need to use the other tag sets available in the full TEI DTD, TEI Lite provides a valuable introduction to the TEI's encoding scheme. Several people involved in the development and maintenance of the TEI DTD have continued to investigate ways to facilitate its use, such as the 'Pizza Chef' (available at http://www.hcu.ox.ac.uk/TEI/newpizza.html) which offers a web-based method of combining the various tag sets to make your own TEI DTD and an XML version of TEI Lite (see The TEI Consortium Homepage (http://www.tei-c.org/)). It can only be hoped that as more people appreciate the merits of adopting the TEI scheme, the number of freely available SGML/XML TEI-aware tools and applications will continue to grow. Pgina 30 de 45 5.3: Where to find out more about SGML/XML and the TEI Although SGML was released as an ISO standard in 1986, its usage has grown steadily rather than explosively, and uptake has tended to occur within the documentation departments of major corporations, government departments, and global industries. This is in dramatic contrast to XML, which was released as a W3C Recommendation in 1998 but was able to build on the tremendous level of international awareness about the web and HTML (and, to some extent, on the success of SGML in numerous corporate sectors). As a very simple indicator, on the 20th August 1999 the online catalogue of amazon.co.uk (http://amazon.co.uk) listed only 28 books with 'SGML' in the title, as compared to 68 which mention 'XML' (and 5 of these are common to both!). One of the best places to find out more about both the SGML and XML standards, their application, relevant websites, discussion lists and newsgroups, journal articles, conferences and the like, is Robin Cover's excellent The SGML/XML Web Page (http://www.oasis-open.org/cover/). It would be pointless to reproduce a selection of Cover's many references here (as they would rapidly go out of date), but readers are strongly urged to visit this website and use it to identify the most relevant information sources. However, it is also important to remember that XML (like SGML), is only one amongst a family of related standards, and that these XML-related standards are developing and changing very rapidly so you should remember to visit these sites regularly, or risk making the wrong decisions on the basis of out-dated information. Keeping up-to-date with the Text Encoding Initiative is a much more straightforward matter. The website of the TEI Consortium (http://www.tei-c.org/) provides the best starting point to accessing other TEI-related online resources, whilst the TEI-L@LISTSERV.UIC.EDU discussion list is an active forum for anyone interested in using the TEI's Guidelines and provides an extremely valuable source of advice and support. Chapter 6 : Documentation and Metadata 6.1 What is Metadata and why is it important? Simply put, metadata is one piece of data which describes another piece of data. In the context of digital resources the kind of information you would expect to find in a typical metadata record would be data on the nature of a resource, who created the resource, what format it is held in, where it is held, and so on. In recent years the issue of metadata has become a serious topic for those concerned with the creation and management of digital resources. When digital resources first started to emerge much of the focus of activity was centred on the creation process, without much thought given to how these resources would be documented and found by others. In the academic arena announcements of the availability of resources tended to be within an interested community, usually though subject-based discussion lists. However, as use of the web has steadily increased, many institutions have come to depend on it as a crucial means of storing and distributing information. The means by which this information is organised has now become a central issue if the web is to continue to be an effective tool for the digital information age. While there is an overwhelming consensus that a practical metadata model is required, a single one has yet to emerge which will satisfy the needs of the net community as a whole. This section of the Guide will look at two metadata models currently in use, the Dublin Core Element Set, and the TEI Header, but we begin with an overview of the problem as it stands at the moment. The concept of metadata has been around much longer then the web, and while there exist a great number of metadata formats, it is most often associated with the work of the library community. The web is commonly likened to an enormous library for the digital age, and while this analogy may not stand up to any serious scrutiny, it is a useful one to make as it highlights the perceived problems associated with metadata and digital resources and points towards possible solutions. At its inception the web was not designed nor intended as a forum for the organised publication and retrieval of information and therefore no system for effectively cataloguing information held on the web was devised. Due to this lack of formal cataloguing procedures the web has evolved into a 'chaotic repository for the collective output of the world's digital "printing presses"' (Lynch 1997). Locating an item on a library shelf is a relatively simple task due to our familiarity with a long-established procedure for doing so. Library metadata systems, such as MARC, follow a Pgina 31 de 45 strictly defined set of rules which are applied by a trained body of professionals. The web has few such parallels. One of the most common ways of locating items on the web is via a search engine, and it is to these that the proper application of metadata would be most beneficial. While search engines are undeniably powerful they do not operate in an effective and precise enough way to make them trustworthy tools of retrieval. It is estimated that there are in the region of three and a half million web sites containing five hundred million unique files (OCLC Web Characterisation Project, June 99 http://www.oclc.org/oclc/research/projects/webstats), only one-third of which are indexed by search engines. The web contains much that is difficult to catalogue in a straightforward manner multimedia packages, audio and visual material, not to mention pages which are automatically generated and all demand consideration in any system which attempts to catalogue them. The method by which search engines index a web site is based on the frequency of occurrences of words which appear in the document rather than identifying any real notion of its content. The indiscriminate nature of the searches not only make it difficult to find what you are looking for but often bury any potentially useful information in a flurry of unwanted, unrelated 'hits'. The growing commercialisation of the web has influenced the nature of search engines and made them even more unreliable and of dwindling practical use to the academic community. While search engines are now able to make better use of the HTML tag (although the tag can be open to abuse by index spamming), it is perhaps a case of too little too late. Initiatives such as the Dublin Core go some way in trying to redress the balance, but these are still being refined and have numerous shortcomings. The Dublin Core, in an attempt to maintain its simplicity fails to achieve its hoped for functionality, trading off much of its potential precision in a quest for general acceptance. The Dublin Core element set is, in places, too general to describe coherently the complex relationships which exist within many digital resources, and lacks the required rigidity, in areas such as the use of controlled vocabulary, to make it easily interoperable. This applies particularly in regard to the original unqualified 15 elements, but the work of bodies such as the Dublin Core Data Model working group, implementing Dublin Core in RDF/XML, are providing potential solutions to these problems (http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc- rdf/). While a single metadata scheme, adopted and implemented wholescale would be the ideal, it is probable that a proliferation of metadata schemes will emerge and be used by different communities. This makes the current work centred on integrated services and interoperability all the more important. 6.1.1: Conclusion and current developments The need for a solution to the problem of how to document data on the web so that they can be located and retrieved with the minimum of effort is now essential if the web is to continue to thrive as a major provider of our daily resources. It is generally recognised that what is required is a metadata scheme which contains 'the librarian's classification and selection skillscomplemented by the computer scientist's ability to automate the task of indexing and storing information' (Lynch 1997). Existing models do not go far enough in providing a framework that satisfies the precise requirements of different communities and discipline groups, and until clear guidelines become available on how metadata records should be created in a standardised way, little progress will be made. In the foreseeable future it is unlikely that some outside agent will prepare your metadata for you, and proper investment in web cataloguing methods is therefore essential if its implementation is to be executed successfully. New developments and proposals are being investigated in an attempt to find solutions in the face of these seemingly insurmountable problems. The Warwick Framework (http://www.ukoln.ac.uk/metadata/resources/wf.html) for example suggests the concept of a container architecture, which can support the coexistence of several independently developed and maintained metadata packages which may serve other functions (rights management, administrative metadata, etc.). Rather than attempt to provide a metadata scheme for all web resources, the Warwick Framework uses the Dublin Core as a starting point, but allows individual communities to extend this to fit their own subject-specific requirements. This movement towards a more decentralised, modular and community-based solution, where the 'communities of expertise' themselves create the metadata they need has much to offer. In the UK, various funded organisations such as the AHDS (http://ahds.ac.uk/), and projects like ROADS (http://www.ilrt.bris.ac.uk/roads/) and DESIRE (http://www.desire.org/) are all involved in assisting the development of subject-based information gateways that provide metadata-based services tailored to the needs of particular user communities. Pgina 32 de 45 It is clear that there is still some way to go before the problems of metadata for describing digital resources have been adequately resolved. Initiatives created to investigate the issues are still in their infancy, but hopefully solutions will be found, either globally or within distinct communities, which will provide a framework simple enough to be used by the maximum number of people with the minimum degree of inconvenience. 6.2 The TEI Header The work and objectives of the Text Encoding Initiative (TEI) and the guidelines it produced for text encoding and interchange have already been discussed in Chapter 5. In this section dealing with metadata, we will focus on how the TEI has approached the problems particular to the effective documentation of electronic texts. This section will look at the TEI Header, and specifically, the version of the header as provided by the TEI Lite DTD (http://www.hcu.ox.ac.uk/TEI/Lite/) Unlike the Dublin Core element set, the TEI Header is not designed specifically for describing and locating objects on the web, although it can be used for this purpose. The TEI Header provides a mechanism for fully documenting all aspects of an electronic text. The TEI Header does not limit itself to documenting the text only, but also provides a system for documenting its source, its encoding practices, and the process of its creation. The TEI Header is therefore an essential resource of information for users of the text, for software that has to process the metadata information, and for cataloguers in libraries, museums, and archives. In contrast with the Dublin Core, whose inclusion in any document is voluntary, the presence of the TEI Header is mandatory if the document is to be considered TEI conformant. As with the full TEI Lite tag set, a number of optional elements are offered by the TEI Header (of which only one, the <filedesc>, is mandatory) for use in a structured way. These elements are capable of being extended by the addition of attributes on the elements. Therefore the TEI Header can range from a very large and complex document to a simple, concise piece of metadata. The most basic valid TEI Lite header would look something like: <teiHeader><fileDesc><titleStmt><title> A guide to good practice </title></titlestmt><publicationStmt><p> Published by the AHDS, 1999 </publicationStmt><sourceDesc><bibl> A dual web and print publication </sourceDesc></fileDesc></teiHeader> At its simplest a TEI Lite Header requires no more than a description of the electronic file itself, a description which includes some kind of statement on what the text is called, how it is published, and if it has been derived or transcribed from another source. A typical TEI Header would hopefully contain more detailed information relating to a document. In general the header should be regarded as providing the same kind of information analogous to that provided by the title page of a printed book, combined with the information usually found in an electronic 'readme' file. As with the Dublin Core <META> tag, the TEI Header tag appears at the beginning of a text (although it can be held separately from the document) between the SGML prolog (i.e. the SGML declaration and the DTD) and the front matter of the text itself: <!DOCTYPE tei.2 PUBLIC "-//TEI//DTD TEI Lite 1.6//EN"><tei.2> <teiheader> [header details go here] </teiheader> <text> <front> ... </front> Pgina 33 de 45 <body> The metadata information contained within the TEI Header can also be utilised as an effective resource for the information management of texts. In the same way that an online library catalogue allows different search options and views of a collection, the metadata information in the TEI Header can also be manipulated to present different access points into a collection of electronic texts. For example, rather than maintain a separate, static catalogue or database, the holdings of the OTA as recorded in the metadata information stored in the TEI Headers is used to assist in the identification and retrieval of resources. In addition to being able to perform simple searches for the author or title of a work, users of the OTA catalogue can submit complex queries on a number of available options, such as searching for resources by language, genre, time period, and even by file format. Additional to its ability to construct indexes and catalogues dynamically, the metadata contained within the TEI Header can also be used to create other metadata and catalogue records. TEI Header metadata can be extracted and mapped onto other well-established resource cataloguing standards, such as library MARC records, or on to emerging standards such as the Dublin Core element set and the Resource Description Framework (RDF). This is a relatively simple task since the TEI Header was closely modelled on existing standards in library cataloguing.For example the TEI Lite <author> tag within the <titleStmt> is analogous to the 100 MARC AUTHOR record field and also to the Dublin Core CREATOR element. There is no need, therefore, to maintain several different metadata formats when they can simply be filtered from one central information source. For more details see (http://www.ukoln.ac.uk/metadata/interoperability/) and (http://www.hcu.ox.ac.uk/ota/public/publications/metadata/giordano.sgm) 6.2.1: The TEI Lite Header Tag Set Although the TEI Lite Header has only one required element (the <fileDesc>) it is recommended that all four of the principal elements which comprise the header be used. The TEI Header provides scope to describe practically all of the textual and non-textual aspects of an electronic text, so the recommendation when creating a Header is to include as much information as is possible. The following overview of the four main elements which go to make up the Header is by no means exhaustive, a more comprehensive account with examples can be found in the Gentle Introduction to SGML (see: http://www.hcu.ox.ac.uk/TEI/Lite/teiu5_en.htm) The four recommended elements which go to make a <teiHeader> are: <fileDesc>: the file description. This element contains a full bibliographic description of an electronic file. <encodingDesc>: the encoding description. This element documents the relationship between an electronic text and the source(s) from which it was derived. <profileDesc>: the profile description. This element provides a detailed description of the non- bibliographic aspects of a text, specifically the languages and sub-languages used, the situation in which it was produced, the participants and their setting. <revisionDesc>: the revision description. This element summarises the revision history of a file. The elements within the TEI Header fall into three broad categories of content: - Descriptions (containing the suffix Desc) can contain simple prose descriptions of the content of the element. These can also contain specific sub-elements. - Statements (containing the suffix Stmt) indicate that the element groups together a number of specialised elements recording some structured information. - Declarations (containing the suffix Decl) enclose information about specific encoding practices applied to the electronic text. Pgina 34 de 45 The file description: <fileDesc> The file description contains a full bibliographic description of the computer file itself. It should provide enough useful information in itself to construct a meaningful bibliographic citation or library catalogue entry. The <fileDesc> contains three mandatory, and four optional elements: <titleStmt>: groups information relating to the title of the work and those responsible for its intellectual content. Details of any major funding or sponsoring bodies can also be recorded here. This element is mandatory. <editionStmt>: groups together information relating to one edition of a text. This element may contain information on the edition or version of the electronic work being documented. <extent>: simply records the size of the electronic text in a recognisable format, e.g. bytes, Mb, words, etc. <publicationStmt>: records details of the publication or distribution details of the electronic text including a statement on its availability status (e.g. freely available, restricted, forbidden, etc.). This element is mandatory. An <idno> is also included to provide a useful mechanism for identifying a bibliographic item by assigning it one or more unique identifiers. <seriesStmt>: groups together information about a series, if any, to which a publication belongs. Again an <idno> element is supplied to help with identifying the unique individual work. <noteStmt>: groups together any notes providing information about a text additional to that recorded in other parts of the bibliographic description. This general element can be made use of in a variety of ways to record potentially significant details about the text and its features which have not already been accommodated elsewhere in the header. <sourceDesc>: groups together details of the source or sources from which the electronic edition was derived. This element may contain a simple prose description of the text or more complex bibliographic elements may be employed to provide a structured bibliographic reference for the work. This element is mandatory. The encoding description: <encodingDesc> <encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived. The <encodingDesc> can contain a simple prose description detailing such features as the purpose(s) for which the work was encoded, as well as any other relevant information concerning the process by which it was assembled or collected. While there are no mandatory elements within the <encodingDesc>, those available are useful for documenting the rationale behind how and why certain elements have been implemented. <projectDesc>: used to describe, in prose, the purpose for which the electronic text was encoded (for example if a text forms a part of a larger collection, or was created with a particular audience in mind). <samplingDecl>: useful in identifying the rationale behind the sampling procedure for a corpus. <editorialDecl>: provides details of the editorial principles applied during the encoding of a text, for example it can record whether the text has been normalised or how quotations in a text have been handled. <tagsDecl>: groups information on how the SGML tags have been used, and how often, within a text. <refsDecl>: commonly used to identify which SGML elements contain identifying information, and whether this information is represented as attribute values or as content. Pgina 35 de 45 <classDecl>: defines which descriptive classification schemes (if any) have been used by other parts of the header. The profile description: <profileDesc> <ProfileDesc> : The profile description details the non-bibliographic aspects of a text, specifically the languages used in the text, the situation in which the text was produced, and the participants involved in the creation. <creation>: groups information detailing the time and place of creation of a text. <langUsage>: records the languages (including dialects , sub-languages, etc.). used in the text. <textClass>: describes the nature or topic of the text in terms of a standard classification scheme. Included in this element is a useful <keyword> tag which can be used to identify a particular classification scheme used, and which keywords from this scheme were used. The revision description: <revisionDesc> <revisionDesc>: provides a detailed system for recording changes made to the text. This element is of particular use in the administration of files, recording when changes were made to text and by whom. The <revisionDesc> should be updated every time a significant alteration has been made to a text. 6.2.2 The TEI Header: Conclusion The above overview hopefully demonstrates the comprehensive nature of the TEI Header as a mechanism for documenting electronic texts. The emergence of the electronic text over the past decade has presented librarians and cataloguers with many new challenges. Existing library cataloguing procedures, while inadequate to document all the features of electronic texts properly, were used as a secure foundation onto which additional features directly relevant to the electronic text could be grafted. Chapter Nine of AACR2 (Anglo-American Cataloguing Rules) requires substantial updating and revision, as it assumes that all electronic texts are published through a publishing company and cannot adequately catalogue texts which are only published on the Internet. The TEI Header has proved to be an invaluable tool for those concerned with documenting electronic resources; its supremacy in this field can be measured by the increasing number of electronic text centres, libraries, and archives which have adopted its framework. The Oxford Text Archive has found it indispensable as a means of managing its large collection of disparate electronic texts, not only as a mechanism for creating its searchable catalogue, but as a means of creating other forms of metadata which can communicate with other information systems. Ironically it is the same generality and flexibility offered by the TEI Guidelines (P3) on creating a header which have hindered the progress of one of the main goals of the TEI and the hopes of the electronic text community as a whole, namely the interoperability and interchangeability of metadata. Unlike the Dublin Core element set, which has a defined set of rules governing its content, the TEI Header has a set of guidelines, which allow for widely divergent approaches to header creation. While this is not a major problem for individual texts, or texts within a single collection, the variant way in which the guidelines are interpreted and put into practice make easy interoperability with other systems using TEI Headers more difficult than first imagined. As with the Dublin Core element set, what is required is the wholescale adoption of a mutually acceptable code of practice which header creators could implement. One final aspect of the TEI Header which is a cause of irritation to those creating and managing TEI Headers and texts; the apparent dearth of affordable and user-friendly software aimed specifically at header production. While this has long been a general criticism of SGML applications as a whole, the TEI can in no way be held to blame for this absence, as it was not part of the TEI remit to create software. However it has contributed to the relatively slow uptake and implementation of the TEI Header as the predominant method of providing well structured metadata to the electronic text community as a whole. Until this situation is adequately resolved the tools on offer tend to be freeware products designed by people within the SGML community itself, or large and very expensive purpose-built SGML aware products aimed at the commercial market. Further reading: Pgina 36 de 45 The SGML/XML Web Page (http://www.oasis-open.org/cover/sgml-xml.html) Ebenezer's software suite for TEI (http://www.umanitoba.ca/faculties/arts/linguistics/russell/ebenezer.htm) TEI home page (http://www.tei-c.org/ 6.3 The Dublin Core Element Set and the Arts and Humanities Data Service 'The Dublin Core is a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of web resources, it has also attracted the attention of formal resource description communities such as museums and libraries' Dublin Core Metadata home page (http://purl.oclc.org/metadata/dublin_core/) By the mid-1990s large-scale web users, document creators and information providers had recognised the pressing need to introduce some kind of workable cataloguing scheme for documenting resources on the web. The scheme needed to be accessible enough to be adopted and implemented by typical web content creators who had little or no formal cataloguing training. The set of metadata elements also needed to be simpler than those used in traditional library cataloguing systems while offering a greater precision of retrieval than the relatively crude indexing methods employed by existing search engines and web crawlers. The Dublin Core Metadata Element Set grew out of a series of meetings and workshops consisting of experts from the library world, the networking and digital library research community, and other content specialists. The basic objectives of the Dublin Core initiative included: - to produce a core set of descriptive elements which would be capable of describing or identifying the majority of resources available on the Internet. Unlike a traditional library where the main focus is on cataloguing published textual materials, the Internet contains a vast range or material in a variety of formats, including non-textual material such as images or video, most of which have not been 'published' in any formal way. - to make this scheme intelligible enough that it could be easily utilised by trained cataloguers but still retain enough content that it functioned effectively as a catalogue record. - to encourage the adoption of the scheme on an international level by ensuring that it provided the best format for documenting digital objects on the web The Dublin Core element set provides a straightforward framework for documenting features of a work such as who created the work, what its content is and what languages it contains, where and from whom it is available, and in what formats, and whether it derived from a printed source. At a basic level the element set uses commonly understood terms and semantics which are intelligible to most disciplines and information systems communities. The descriptive terms were chosen to be generic enough to be understood by a document author, but could also be extended to provide full and precise cataloguing information. For example textual authors, painters, photographers, writers of software programs can all be considered 'creators' in a broad sense. In any implementation of the Dublin Core element set, all elements are optional and repeatable. Therefore if a work is the result of a collaboration between a number of contributors it is relatively easy to record the details of each one (name, contact details etc.) as well as their specific contribution or role (author, editor, photographer, etc.) by simply repeating the appropriate element. These basic details can be extended by the use of Dublin Core qualifiers. The Dublin Core initiative originally defined three different kinds of qualifier: type (or sub-element) to broadly refine the semantics of an element name, language to specify the language of an element value, and scheme to note the existence of an element value taken from an externally defined scheme or standard. Guidelines for implementing these qualifiers in HTML are also available. Work on integrating Dublin Core and the Resource Pgina 37 de 45 Description Framework (RDF), however, revealed that these terms could be the source of confusion. Dublin Core qualifiers are now identified as either element qualifiers that refine the semantics of a particular element or value qualifiers that provide contextual information about an element value. Take the Dublin Core date element, for example. Element qualifiers would allow the broad concept of date to be subdivided into things like 'date of creation' or 'date of last modification', etc. Value qualifiers might explain how a particular element value should be parsed. For example, a date element with a value qualifier of 'ISO 8601' indicates that the string '1999-1-1' should be parsed as the 1st of January 1999. Other value qualifiers might indicate that an element value is taken from a particular controlled vocabulary or scheme, for example to indicate the use of a subject term from an established scheme like the Library of Congress Subject Headings. 6.3.1 Implementing the Dublin Core The Dublin Core element set was designed for documenting web resources and it is easily integrated into web pages using the HTML <META> tag, inserted between the <HEAD>...</HEAD> tags and before the <BODY> of the work. An Internet-Draft has been published that explains how this should be done (http://www.ietf.org/internet-drafts/draft-kunze-dchtml-02.txt). No specialist tools more sophisticated than an average word processor are required to produce the content of a Dublin Core record; however a number of labour-saving devices are available, notably the DC-dot generator available from the UKOLN web site (http://www.ukoln.ac.uk/metadata/dcdot/). DC-dot can automatically generate Dublin Core metadata for a web site and encode this in HTML <META> tags and other formats. The metadata produced can also be easily edited and extended further. The Nordic Metadata Project Template is an alternative way of creating simple Dublin Core metadata that can be embedded in HTML <META> tags (http://www.lub.lu.se/cgi-bin/nmdc.pl). 6.3.2 Conclusions and further reading The Dublin Core element scheme offers enormous potential as a usable standard cataloguing procedure for digital resources on the web. The core set of elements are broad and encompassing enough to be of use to novice web authors and skilled cataloguers alike. However its success will ultimately depend on its wide-scale adoption by the Internet community as a whole. It is also crucial that the rules of the scheme be implemented in an intelligent and systematic way. To fulfil this objective, more has to be done to refine and stabilise the element set. The provision and use of simple Dublin Core generating tools, which demonstrate the benefits of including metadata, need to become more prevalent. The Arts and Humanities Data Service (AHDS), in association with the UK Office for Library and Information Networking (UKOLN), has produced a publication which outlines in more detail the best practices involved in using Dublin Core, as well as giving many practical examples: Discovering Online Resources across the Humanities: A practical implementation of the Dublin Core (ISBN 0-9516856-4-3). This publication is also freely available from the AHDS web site (http://ahds.ac.uk/) A practical illustration of how the Dublin Core element set can be implemented in order to perform searches for individual items across disparate collections is the AHDS Gateway (http://ahds.ac.uk:8080/ahds_live/). The AHDS Gateway is, in reality, an integrated catalogue of the holdings of the five individual Service Providers, which make up the AHDS. Although the Service Providers are separated geographically, by providing Dublin Core records describing each of their holdings, users can very simply search across the complete holdings of the AHDS from one single access point. 6.3.3 The Dublin Core Elements This set of official definitions of the Dublin Core metadata element set is based on: http://purl.oclc.org/metadata/dublin_core_elements Element Descriptions 1. Title Label: TITLE The name given to the resource by the CREATOR or PUBLISHER. Where possible standard authority files should be consulted when entering the content of this element. For example the Library of Pgina 38 de 45 Congress or British Library title lists can be used, but always remember to indicate the source using the 'scheme' qualifier. If authorities are to be used, these would need to be indicated as a value qualifier 2. Author or Creator Label: CREATOR The person or organisation primarily responsible for creating the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources. Note that this element does not refer to the person who is responsible for digitizing a work; this belongs in the CONTRIBUTOR element. So in the case of a machine-readable version of King Lear held by the OTA, the CREATOR remains William Shakespeare, and not the person who transcribed it into digital form. Again, standard authority files should be consulted for the content of this element. 3. Subject and Keywords Label: SUBJECT The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemas is encouraged. 4. Description Label: DESCRIPTION. A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. 5. Publisher Label: PUBLISHER The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity. 6. Other Contributor Label: CONTRIBUTOR A person or organisation not specified in a CREATOR element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organisation specified in a CREATOR element (for example, editor, transcriber, and illustrator). 7. Date Label: DATE The date the resource was made available in its present form. Recommended best practice is an 8 digit number in the form YYYY-MM-DD as defined in http://www.w3.org/TR/NOTE-datetime, a profile of ISO 8601. In this scheme, the date element 1994-11-05 corresponds to November 5, 1994. Many other schema are possible but, if used, they should be identified in an unambiguous manner. 8. Resource Type Label: TYPE The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. For the sake of interoperability, TYPE should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document. See http://sunsite.berkeley.edu/Metadata/types.html for current thinking on the application of this element 9. Format Pgina 39 de 45 Label: FORMAT The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource. For the sake of interoperability, FORMAT should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document. 10. Resource Identifier Label: IDENTIFIER Unique string or number used to identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers, such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element in the case of off-line resources. 11. Source Label: SOURCE Unique string or number used to identify the work from which this resource was derived, if applicable. For example, a PDF version of a novel might have a SOURCE element containing an ISBN number for the physical book from which the PDF version was derived. 12. Language Label: LANGUAGE Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with RFC 1766. See: http://info.internet.isi.edu/in-notes/rfc/files/rfc1766.txt 13. Relation Label: RELATION The relationship of this resource to other resources. The intent of this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. Formal specification of RELATION is currently under development. Users and developers should understand that use of this element is currently considered to be experimental. 14. Coverage Label: COVERAGE The spatial and/or temporal characteristics of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element is currently considered to be experimental. 15. Rights Management Label: RIGHTS A link to a copyright notice, to a rights-management statement, or to a service that would provide information about terms of access to the resource. Formal specification of RIGHTS is currently under development. Users and developers should understand that use of this element is currently considered to be experimental. Pgina 40 de 45 Chapter 7: Summary This final chapter is not intended to duplicate material contained elsewhere in this Guide. Instead, it outlines the ten major steps which make up an ideal electronic text creation project. Of course readers should bear in mind that, as we live in a far from ideal world, it is usually necessary to revisit some steps in the process several times over. Step 1: Sort out the rights There is absolutely no point in trying to proceed with any kind of electronic text creation project if you have not obtained appropriate permissions from all those who hold any form of rights in the material with which you are hoping to work. This can be a tedious and time-consuming process, but time spent now can save unpleasant and potentially costly legal wrangles later on. Many archives and libraries will be happy for you to use their material (e.g. in the case of manuscript sources) provided that they are given appropriate attribution, and perhaps some small recompense if you intend to create a saleable resource. If you are working from photographs, facsimiles, or microfilm, then the creators and publishers of these items will also have rights which need to be considered. Similarly, if you are working from printed sources, you will need to ensure that nothing you are doing will infringe any of the rights held by the publishers and/or editors (although you may be able to negotiate the necessary permissions if you have a clear idea how the material will be used). Even if you are working from an electronic text which you obtained at no cost (e.g. via the web), you should still clarify the rights situation concerning your source material. Obtain all permissions in writing rather than relying upon verbal assurances or standard disclaimers and never assume that people will not bother to sue you. If in doubt, take professional legal advice and it is worth investigating whether or not your institution already has a dedicated Copyright Officer or retains specialist legal staff who may be able to offer you some assistance. Step 2: Assess your material Refer to the chapters on Document Analysis and Digitization to establish the best way to capture and represent your source material. At some level this will almost certainly necessitate a degree of compromise between what you would like to do, and what you are able to do with the knowledge and resources currently available to you. However, it is important to consider the implications of any decisions taken now, and to ensure that as far as possible you facilitate the future reuse of your material. Step 3: Clarify your objectives This relates to Step 2. The better your sense of how you would like to use your electronic text (and/or how you envisage others using it), the easier it will be to establish how you should set about creating it. There is little point in creating lavish high-quality digital images, or richly encoded transcriptions, if all you wish to do is construct a basic concordance or perform simple computer-assisted linguistic analyses. However, if you are aiming to produce a flexible electronic edition of your source text one which will support many kinds of scholarly needs or simply wish to offer users a digital surrogate for the original item, then such an investment may be worthwhile. You may find it easier to obtain financial support for your efforts if you can demonstrate that the deliverables will be amenable to multiple uses. Step 4: Identify the resources available to you and any relevant standards There are few substitutes for good local advice and support, so consult widely at your host institution as well as contacting bodies like the AHDS (http://ahds.ac.uk). Remember that for straightforward tasks such as scanning, OCR, or copy-typing, it may be more cost-effective to employ graduate student labour on an hourly basis than sub-contract the work to a commercial service, or employ a Research Assistant. Technical skills date rapidly, and it is rarely worth acquiring them yourself unless they will become central to your work and you are prepared to update them regularly. Whenever possible, you should aim to use open or de facto standards as this is the best way to increase the chances that your digital resource(s) will remain viable in the long term. Pgina 41 de 45 Step 5: Develop a project plan Any electronic text creation project is at the mercy of the technology involved, so careful planning is the key to minimising hold-ups. Consider scheduling a piloting and testing phase to help you resolve most of the procedural and technical problems. You should also build in a mechanism for on-going quality control and checking, as mistakes in digital data can be very expensive to correct retrospectively. You should document all the key decisions and actions at every stage in the project, and ensure that any metadata records are kept up-to-date and complete. Step 6: Do the work! If you have prepared well and carried out each of the previous steps, then this should be the most straightforward phase of the entire project. Step 7: Check the results If you have been conducting quality control checks throughout the data creation process, then this step should reveal few surprises. However, if absolute fidelity to the original source is of fundamental importance to your work, it may be worthwhile investing in a separate programme of proof-reading. Simple checks to ensure that you have captured all your original sources, and that your data have been prepared and organised as you intended, can identify potentially costly mistakes which are easy to overlook. For example, if you are creating a series of digital images to create a facsimile edition of a printed work, ensure that any sequencing of the images matches the pagination of the original analogue source. Similarly, if you are conducting a computer-assisted analysis of a transcribed text, the omission of a small but vital section could affect the validity of any results. Step 8: Test your text Whether your aim was to produce a data source for secondary analysis, an electronic edition for use by others or something else entirely you will need to ensure that what you have produced is actually fit for its intended purpose. You may find that by sharing your work with others, you will gain valuable advice and guidance upon how the resource could be improved or developed to meet the needs of fellow researchers, teachers, and learners. Such sharing can be a frustrating process, especially if other people fail to appreciate why you undertook the work in the first place but often such feedback can dramatically improve the quality and (re-)usability of a resource, for relatively little extra effort. Step 9: Prepare for preservation, maintenance, and updating [Ideally, you should have prepared for this step as part of developing your project plan (Step 5)]. If you have adopted open or de facto standards, then the preservation and maintenance of your data should present few surprises. If you are depositing your data with another agency (such as the AHDS), or another part of your institution (e.g. library services), then by following good practice in data creation and documentation you will have created an electronic resource with excellent prospects for long-term viability. Updating your data and/or the resulting resource raises several different issues: from technical matters of version control and how best to indicate to other users that the data/resource may have changed since last used, to possible sources of continuation funding. Step 10: Review and share what you have learned This can be an extremely valuable exercise, which can inform not only your own work and any future funding bids that you might make, but also those of colleagues working in the same (or comparable) discipline areas. There are several ways to disseminate information about your experiences, with a number of humanities computing journals, conferences, and agencies (such as the AHDS and JISC), being keen to ensure that lessons learned from practical experience are shared throughout the community Bibliography ADOBE SYSTEMS. Adobe PostScript Overview [online]. Available from: http://www.adobe.com/print/postscript/main.html [Accessed 12 Nov 1999]. Pgina 42 de 45 ADOBE SYSTEMS. Adobe PostScript Licensees and Development Partners [online]. Available from: http://www.adobe.com/print/postscript/oemlist.html [Accessed 12 Nov 1999]. AMERICAN LIBRARY ASSOCIATION (ALA), 1998. Committee on Cataloging: Description and Access Task Force on Metadata and the Cataloging Rules Final Report [online]. Available from: http://www.ala.org/alcts/organization/ccs/ccda/tf-tei2.html [Accessed 12 Nov 1999]. APEX DATA SERVICES, INC. Data Conversion Services [online]. Available from:http://www.apexinc.com/dcs/dcs_index.html [Accessed 12 Nov 1999]. BURNARD, L.D., AND SPERBERG-MCQUEEN, C.M., 1995. TEI Lite: An Introduction to Text Encoding for Interchange (TEI U5). Available from: http://www.hcu.ox.ac.uk/TEI/Lite/ [Accessed 12 Nov 1999]. CAERE OMNIPAGE. OmniPage Pro 10: Product Factsheet [online]. Available from: http://www.caere.com/products/omnipage/pro/factsheet.asp [Accessed 12 Nov 1999]. COVER, R.The SGML/XML Web Page [online]. Available from: http://www.oasis-open.org/cover/ [Accessed 12 Nov 1999]. DAY, M., 1997. Extending Metadata for Digital Preservation. Ariadne [online], 9. Available from: http://www.ariadne.ac.uk/issue9/metadata/ [Accessed 12 Nov 1999]. GASKELL, P., 1995. A New Introduction to Bibliography. Delaware, Oak Knoll Press. GOLDFARB, C. F., 1990. The SGML Handbook. Oxford: Oxford University Press. GROVES, P.J. AND LEE, S.D.,1999. 'On-Line Tutorials and Digital Archives' or 'Digitising Wilfred' [online]. Available from: http://info.ox.ac.uk/jtap/reports/index.html [Accessed 12 Nov 1999]. HEERY, R., POWELL, A., AND DAY, M., 1997. Metadata. Library and Information Briefings, 75, 119. HEWLETT-PACKARD. Choosing a Scanner [online]. Available from: http://www.scanjet.hp.com/shopping/list.htm [Accessed 12 Nov 1999] LEE, S.D., 1999. Scoping the Future of Oxford's Digital Collections [online]. Available from: http://www.bodley.ox.ac.uk/scoping/ [Accessed 12 Nov 1999]. LIBRARY OF CONGRESS. American Memory Project and National Digital Library Program [online]. Available from: http://lcweb2.loc.gov/ [Accessed 11 Nov 1999]. LYNCH, C., 1997. Searching the Internet. Scientific American [online]. Available from: http://www.sciam.com/0397issue/0397lynch.html [Accessed 12 Nov 1999]. MILLER, P., 1996. Metadata for the Masses. Ariadne [online], 5. Available from: http://www.ariadne.ac.uk/issue5/metadata-masses/ [Accessed 12 Nov 1999]. MILLER, P., AND GREENSTEIN, D., 1997. Discovering Online Resources Across the Humanities: A Practical Implementation of the Dublin Core [online]. Bath: UKOLN. Available from: http://ahds.ac.uk/public/metadata/discovery.html [Accessed 12 Nov 1999]. OCLC. Cataloging Internet Resources A Manual and Practical Guide (Second Edition) (N.B. Olson, ed.) [online]. Available from: http://www.purl.org/oclc/cataloging-internet [Accessed 12 Nov 1999]. OCLC. Dublin Core Metadata Initiative [online]. Available from: http://purl.oclc.org/dc/ [Accessed 12 Nov 1999]. PEPPER, S.The Whirlwind Guide to SGML & XML Tools and Vendors [online]. Available from:http://www.infotek.no/sgmltool/guide.htm [Accessed 12 Nov 1999]. ROBINSON, P., 1993. The Digitization of Primary Textual Sources. Oxford: Office for Humanities Communication Publications. Pgina 43 de 45 SEAMAN, D., 1994. Campus Publishing in Standardized Electronic Formats HTML and TEI [online]. Available from: http://etext.lib.virginia.edu/articles/arl/dms-arl94.html [Accessed on 12 Nov 1999]. SHILLINGSBURG, P.L., 1996. Scholarly Editing in the Computer Age: Theory and Practice. 3rd ed. Ann Arbor: University of Michigan Press. SPERBERG-MCQUEEN, C.M., AND BURNARD, L.D. (eds) 1994 (revised 1999). Guidelines for Electronic Text Encoding and Interchange [online]. Available from: http://www.hcu.ox.ac.uk/TEI/P4beta/ [Accessed 12 Nov 1999]. SPERBERG-MCQUEEN, C.M., AND BURNARD, L.D., 1995a. The Design of the TEI Encoding Scheme. Computers and the Humanities 29, 1739. TANSELLE, G.T., 1981. Recent Editorial Discussion and the Central Questions of Editing. Studies in Bibliography 34, 2365. TEXT ENCODING INITIATIVE (TEI), 1987. The Poughkeepsie Principles: The Preparation of Text Encoding Guidelines [online]. Available from: http://www-tei.uic.edu/orgs/tei/info/pcp1.html [Accessed 12 Nov 1999]. TEXT ENCODING INITIATIVE (TEI). The Pizza Chef: a TEI Tag Set Selector [online]. Available from: http://www.hcu.ox.ac.uk/TEI/newpizza.html [Accessed 12 Nov 1999]. TEXT ENCODING INITIATIVE (TEI). The TEI Consortium Homepage [online]. Available from:http://www.tei- c.org/ [Accessed 12 Nov 1999]. UKOLN. Metadata [online]. Available from: http://www.ukoln.ac.uk/metadata/ [Accessed 12 Nov 1999]. UNIVERSITY OF VIRGINIA EARLY AMERICAN FICTION PROJECT [online]. Available from: http://etext.lib.virginia.edu/eaf/intro.html [Accessed 12 Nov 1999]. UNIVERSITY OF VIRGINIA ELECTRONIC TEXT CENTER. Archival Digital Image Creation [online]. Available from: http://etext.lib.virginia.edu/helpsheets/specscan.html [Accessed 12 Nov 1999]. UNIVERSITY OF VIRGINIA ELECTRONIC TEXT CENTER. Image Scanning: A Basic Helpsheet [online]. Available from: http://etext.lib.virginia.edu/helpsheets/scanimage.html [Accessed 12 Nov 1999]. W3C. HyperText Markup Language Home Page [online]. Available from: http://www.w3.org/MarkUp/ [Accessed 12 Nov 1999]. W3C. Extensible Markup Language (XML) 1.0 [online]. Available from: http://www.w3.org/TR/REC-xml [Accessed 12 Nov 1999]. W3C. Extensible Stylesheet Language (XSL) Specification [online]. Available from: (http://www.w3.org/TR/WD-xsl/) [Accessed 12 Nov 1999]. W3C. XHTML 1.0: The Extensible HyperText Markup Language [online]. Available from: http://www.w3.org/TR/xhtml1 [Accessed 12 Nov 1999]. W3C. XML Schema [online]. Available from: http://www.w3.org/TR/xmlschema-1/ [Accessed 12 Nov 1999]. WILFRED OWEN ARCHIVE [online]. Available from: http://www.hcu.ox.ac.uk/jtap/ [Accessed 12 Nov 1999]. YALE UNIVERSITY LIBRARY PROJECT OPEN BOOK [online]. Available from: http://www.library.yale.edu/preservation/pobweb.htm [Accessed 12 Nov 1999]. Glossary AACR2 Anglo-American Cataloguing Rules (2nd ed., 1988 Revision). Rules used in the USA and UK which define the procedure for creating MARC records. AHDS The Arts and Humanities Data Service. Online: http://ahds.ac.uk/ Pgina 44 de 45 ASCII American Standard Code for Information Interchange, sometimes also referred to as 'plain text'. Essentially the basic character set, with minimal formatting (i.e. without changes in font, font size, use of italics etc.) Corpus (pl. Corpora) Informally, an collection of data (e.g. whole texts or extracts, transcribed conversations etc.) selected and organised according to certain principles. For example, a literary corpus might consist of all the prose works of a particular author, while a linguistic corpus might consist of all the forms of Russian verbs or examples of conversations amongst British English dialect speakers. DESIRE Development of a European Service for Information on Research and Education. Online: http://www.desire.org/ Digitize The process by which a non-digital (i.e. analogue) source is rendered in machine-readable form. Most often used to describe the process of scanning a text or image using specialist hardware, to create machine-readable data which can be manipulated by another application (e.g. OCR or image processing software). Document Analysis The task of examining the source object (usually a non-electronic text), in order to acquire an understanding of the work being digitized and what the purpose and future of the project entails. Document analysis is all about definition defining the document context, defining the document type, and defining the different document features and relationships. Usually, document analysis should comprise the first step in any electronic text creation project, and requires users to become intimately acquainted with the format, structure, and content of their source material. DTD Document Type Definition. Rules, determined by an application, that apply SGML or XML to the markup of documents of a particular type. Dublin Core A metadata element set intended to facilitate discovery of electronic resources. EAD Encoded Archival Description Document Type Definition (EAD DTD). A non-proprietary encoding standard for machine-readable finding aids such as inventories, registers, indexes, and other documents created by archives, libraries, museums, and manuscript repositories to support the use of their holdings. Online:http://lcweb.loc.gov/ead/ GIF Graphic Interchange Format. GIF files use an older format that is limited to 256 colours. Like TIFFs, GIFs use a lossless compression format but without requiring as much storage space. While they do not have the compression capabilities of JPEG, they are strong candidates for graphic art and line drawings. They also have the capability to be made into transparent GIFs meaning that the background of the image can be rendered invisible, thereby allowing it to blend in with the background of the web page. HTML HyperText Markup Language is a non-proprietary format (based upon SGML) for publishing hypertext on the World Wide Web. It has appeared in four main versions (1.0, 2.0, 3.2, and 4.0) although the World Wide Web Consortium (W3C) recommends using HTML 4.0. Online: http://www.w3.org/ JPEG Joint Photographic Experts Group. JPEG files are the strongest format for web viewing, and for transfer through systems with space restrictions. JPEGs are popular with image creators not only for their compression capabilities but also for their quality. While a TIFF is a lossless compression, JPEGs are a lossy compression format. This means that as a filesize condenses the image loses bits of information the information least likely to be noticed by the eye. The disadvantage to this format is precisely what makes it so attractive: the lossy compression. Once an image is saved, the discarded information is lost. The implication of this is that the entire image, or certain parts of it, cannot be enlarged. And the more work done to the image, requiring it to be re-saved, the more information is lost. As there is no way to retain all of the information scanned from the source, JPEGs are not recommended for archival storage. Nevertheless, in terms of viewing capabilities and storage size, JPEGs are the best image file format for online viewing. MARC MAchine Readable Cataloguing record. Bibliographic record used by libraries which can be processed by computers. Markup (n.) Text that is added to the data of a document in order to convey information about it. There are several kinds of markup, but the two most important are descriptive markup (often represented using markup tags such as <TITLE>, </H1> etc.), and processing instructions (i.e. the internal instructions required to change the appearance of a piece of data displayed on screen, start a new page when printing, indicate a change in font etc.) Mark up (vb.) To add markup. Pgina 45 de 45 Metadata Data about data. The additional information used to describe something for a particular purpose (although that may not preclude its use for multiple purposes). For example, the 'Dublin Core' describes a set of metadata intended to facilitate the discovery of electronic resources (see http://purl.org/dc/). OCR Optical Character Recognition. OCR software attempts to recognise the characters on an image of a page of text, and output a version of that text in machine-readable form. Modern OCR software can be trained to recognise different fonts, and may use a dictionary to facilitate recognition of certain characters and words. OCR works best with clean, modern, well-printed text. OTA The Oxford Text Archive. Online: http://ota.ahds.ac.uk/ PDF Portable Document Format. The native proprietary file format of the Adobe Acrobat family of products, intended to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created. Online: http://www.adobe.com/ 'Plain Text' See ASCII PostScript Adobe PostScript is a computer language that describes the appearance of a page, including elements such as text, graphics, and scanned images, to a printer or other output device. Online: http://www.adobe.com/print/postscript/main.html RDF The Resource Description Framework. A foundation for processing metadata; it provides interoperability between applications that exchange machine-understandable information on the web. ROADS A set of software tools to enable the set up and maintenance of web based subject gateways. Online http://www.ilrt.bris.ac.uk/roads/ RTF Rich Text Format. A proprietary file format developed by Microsoft that describes the format and style of a document (primarily for the purposes of interchange between different applications, most often common word-processors). Online: http://www.microsoft.com/ SGML The Standard Generalized Markup Language. An International Standard (ISO8879) defining a language for document representation that formalises markup and frees it of system and processing dependencies. SGML is the language used to create DTDs. Online: http://www.oasis-open.org/cover/ TEI The Text Encoding Initiative is an international project which in May 1994 issued its Guidelines for the Encoding and Interchange of Machine-Readable Texts. These Guidelines provide SGML encoding conventions for describing the physical and logical structure of a large range of text types and features relevant for research in language technology, the humanities, and computational linguistics. A revised version of the Guidelines was released in 1999. Online: http://www.hcu.ox.ac.uk/TEI/P4beta/ TEI Lite An SGML DTD which represents a simplified subset of the recommendations set out in the TEI's Guidelines for the Encoding and Interchange of Machine-Readable Texts. Online: http://www.hcu.ox.ac.uk/TEI/Lite/ TeX/LaTeX A popular typesetting language (TeX) and a set of macro extensions (LaTeX) the latter being designed to facilitate descriptive markup. Online: http://www.tug.org/ TIFF Tagged Image File Format. TIFF files are the most widely accepted format for archival image and master copy creation. TIFFs retain all of the scanned image data, allowing you to gather as much information as possible from the original. This is reflected in the one disadvantage of the TIFF image the file size but any type of compression is strongly advised against. Any project that plans to archive images or call them up for future modification should scan using this format. UKLON UK Office for Library and Information Networking. A national focus of expertise in network information management, based at the University of Bath. Online: http://www.ukoln.ac.uk/ Unicode An industry profile of ISO 10646, the Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages. Online: http://www.unicode.org./unicode/consortium/consort.html XML The Extensible Markup Language is a data format for structured document interchange on the Web. The current World Wide Web Consortium (W3C) Recommendation is XML 1.0, February 1998. Online: http://www.w3.org/XML/