You are on page 1of 5

SGML is the substratum on which HTML was conceived and, therefore, is responsible for many of

HTML's strengths and weaknesses. SGML stands for Standard Generalized Markup Language;
this is a formal system designed for building text markup languages. It is not a markup system by
itself, however; think of it as a programming language to build working programs (HTML being
one of them) rather than a program by itself.

In this chapter, you'll learn the foundations of SGML to see how (and why) HTML was built on top
of it. It's very instructive and engaging to trace the roots of the language and explore the
conceptions of its creators. In fact, you can't say you know HTML unless you're at least sketchily
acquainted with its SGML heritage.

We'll analyze the definition of HTML in terms of SGML, consisting of SGML Declaration and
document type definition, both for HTML version 4.0. You'll see what valuable information can be
elicited from these formal constructs and how they can be used for authoritative reference on
HTML topics. You'll also learn why an HTML document should conform to a DTD and how to
ensure this using a validation service.

Knowledge of basic SGML concepts and syntax will provide you with a solid foundation for
mastering HTML and will help you understand some of the peculiarities of the language. The
goal of this chapter, however, is not to teach you SGML or how to write SGML applications, but to
show you how understanding SGML may aid in learning and applying HTML.

A Brief History of SGML

The roots of SGML go back to the late 1960s when the concept of descriptive markup saw the
light for the first time. After companies started using computers for document processing, it soon
became obvious that a storage format should contain not only formatting codes interpreted by
computer itself, but also descriptive human-legible information about the nature and role of every
element in a document.

The first working system that used these concepts was the Generalized Markup Language (GML)
developed by Charles Goldfarb, Edward Mosher, and Raymond Lorie at IBM. This system was
the direct predecessor of SGML and contained the prototypes for many of its major features, such
as hierarchical document structure and document type definitions. IBM adopted GML and built
mainframe-based publishing systems on it that were widely used to produce technical
documentation in the corporation.

In 1978, American National Standards Institute (ANSI) started research in the field of generic
document markup pursuing the goal of establishing a nationwide standard for information
interchange. Later, Dr. Goldfarb joined the ANSI working group, and in 1980 the first draft was
published, then finalized in 1983 as an industry standard named GCA 101-1983.

In 1984, the International Organization for Standardization (ISO) joined the activity and started
preparation of an international version of the standard. The first draft of the ISO standard was
published in 1985 and the final version appeared a year later. The standard bears the full name
"ISO 8879:1986 Information processing---Text and office systems---Standard Generalized Markup
Language (SGML)." The full text of the specification is available from ISO for a fee.

The best use of SGML is generally made in big corporations and agencies that produce a lot of
documents and can afford to introduce a single standard format for internal use. Besides IBM,
early applications of the new technology include the projects developed by Association of
American Publishers (AAP) and the U.S. Department of Defense. Finally, in 1992 researchers at
the European Laboratory for Particle Physics (CERN) chose to build a hypertext markup
language that they called HTML as an SGML application. But that's another story...

Procedural and Descriptive Markup

Any document can be thought of as consisting of content and markup. The content is relatively
straightforward; it bands together all the characters of text, images, and the rest of the meat that
delivers the message of your document. However, if you just add together all this stuff you won't
get a readable document. You need to mark it up---that is, to introduce some new information
into it that couldn't be automatically deduced from the content.

In fact, even a plain ASCII text, often cited as an example of pure text without any formatting,
contains a fair amount of markup. For instance, it is markup that allows you to determine what
the width of text column in a plain text file is, or where the boundaries between paragraphs lie. Of
course, the inventory of markup instructions (called tags) that you can apply is determined by the
format of your document and the tools you use to work with it.

So what is the purpose of markup? In other words, what information can, and should, be
conveyed by a document beyond its content? This extra information can be divided into two
groups: structure and formatting. Structure tells us how the document logically breaks down into
parts (paragraphs, sections, chapters) and how these parts are hierarchically organized, from the
document itself at the very top to the atoms of content at the very bottom. Formatting (in a broad
sense) governs presentation of the document: which fonts to use to display it, in what tone of
voice to read it aloud, how to break it into pages for printing, and so on.

Here enters the Great Markup Controversy that was one of the main driving forces behind the
creation of SGML and whose backwash is still disturbing the HTML community. Its essence,
without the fear of oversimplification, can be reduced to the question of which of the two markup
types is more important and should be given priority.

Obviously, many contemporary text processing systems are heavily biased towards marking up
formatting aspects of a document in the first place (this sort of markup is often called procedural
or presentational). The reason is obvious: What an average user needs most often is a formatted
document, not a structural diagram of its parts.

However, the bitter truth is that procedural markup may actually impede using and reusing the
document if not accompanied, and even preceded, by proper structural (often called descriptive
or generic) markup. For instance, if a document file contains instructions to set a line of text in
Times Roman, size 12 pt, and left-aligned, but never hints that this line is a heading or a figure
caption, then this markup is very restrictive.

What happens is you won't be able to change the formatting of all headings in a document at
once. You'll have difficulty exporting the document into another format or another medium (such
as a voice synthesis system), which may be using different means of formatting headings. You
cannot even automatically generate a table of contents. To put it simply, unless you know what
this part of text is supposed to be, information on how to render it is of very limited value.

On the other hand, after you have added information about the logical role of every element in the
text, the natural next step is to attach the presentational markup to these logical tags rather than
actual parts of the text. Now you don't have to specify font size for a heading in your text any
more; it is enough to mark it up as a heading, and the rest is taken care of automatically (provided
that someone has associated certain formatting parameters with the structural heading tag).

This concept, called "separation of presentation from content," is the major advantage of all
systems that put descriptive markup first. When separated, both content and formatting can be
developed by different people and modified much more easily. Thus, one of the roles of
descriptive markup is to serve as an intermediate layer separating content and formatting of a
document.

It should be admitted that text processing systems that are in common use now (such as office
word processors) do not completely ignore the benefits of descriptive markup. The named styles
that you apply to paragraphs in, say, Microsoft Word, represent some sort of descriptive markup
units with certain formatting tags associated with each of them. Moreover, users can create new
styles as needed for their documents. This provides for a certain level of separation of
presentation from content.

However, such a solution is only partial because style tags do not impose any restrictions on the
structure of your document. For example, it's no problem to assign a heading style to a
paragraph inside a figure caption or a footnote---which is pretty much senseless. Also, you are in
no way discouraged from making direct changes to text attributes, such as font face or size,
thereby overriding their values in a style. Styles in word processors are mere containers for
presentation attributes and not a means to impose some prescribed structure on a document
being created or processed.

You might wonder, do we really need to impose any structure on the document contents? Yes,
and here's why: You can't predict what uses will be made of your document tomorrow or in a year,
what formats it'll need to be converted to, or what media it'll be put onto. By using a strictly
defined set of hierarchical descriptive tags, you ensure that the text can be processed
automatically without any need to manually disambiguate cases such as a heading inside a
footnote. I could say that descriptive markup reveals the immaterial soul of a document so that
any program or person can then conveniently incarnate it into a body of choice.

The provision for automatic processing is the advantage that outshines all others. It is difficult to
imagine how many resources humankind spends annually on preparation, processing, and
interchange of documents. Office computing and desktop publishing software made this work
easier, but, in many cases, proprietary and presentation-oriented tools put more handicaps in the
flow of documentation than they deliver benefits. An open and extensible system of descriptive
markup would thus be invaluable in many situations.

To summarize, what we need is a markup system focusing on structure of a document rather than
its formatting. It should allow us to build a hierarchy of descriptive tags so that they could serve
not only to separate and describe different parts of a document but also to formally prescribe its
structure.

An equally important requirement is that the system should provide for easy extension and
modification. Ideally, a user should be able to define a completely new set of tags if such a need
arises. Finally, this system should not be proprietary; it is important that anyone be free to create
and use markup tools based on this system and to produce software implementing these tools.

SGML is the system designed to satisfy all of these requirements, as well as many others. SGML
is strictly descriptive and contains no means to mark up presentational aspects of documents.
However, SGML can be easily interfaced to external procedural markup systems and style
sheets.

It is the customizability area where SGML reveals its real power. In fact, SGML is not a markup
system by itself; it is, rather, a metasystem enabling users to create such systems for particular
types of documents. Its flexible syntax makes it possible to build markup languages (HTML being
one of the examples) to match any imaginable demand. Moreover, any single SGML document
can be provided with its own "local" markup definitions fine-tuned for the particular purpose.

Just like HTML, SGML is a computer language rather than a data format. This means that you
can create SGML files manually in a text editor, although there exist software tools that facilitate
the task. A piece of software that reads and analyzes an SGML document (for example, for
transformation or validation) is called an SGML parser. A parser by itself, however, is not very
useful because of the purely descriptive nature of SGML, so most often a parser is a part of a
bigger document processing or browsing application.

How to Define an SGML Application

From SGML's point of view, a document is a hierarchical structure of nested elements (chapters,
sections, paragraphs, and so on). SGML has no means---and was not intended to have---for
specifying any presentational aspects of these elements. However, strictly speaking, SGML
cannot tell you about the meaning or role of any element, either. This information is implied by
the creator of an SGML application and is usually provided in comments or in the documentation
accompanying the formal specification.

SGML realizes the maxim of Wittgenstein, who said, "The meaning of a word is its use." In
SGML, the only information that can be formally communicated about an element is in what
contexts and levels of document hierarchy it can or must occur. This means that you cannot build
an interpreter that could apply a meaningful formatting to a document based only on its SGML
markup. However, the purely formal dissection that SGML performs on a document is still
surprisingly useful in many situations.

All documents that can be marked up with the same hierarchy of elements are said to belong to a
certain document type. Rather than describe a set of tools to mark up documents, SGML defines
the structure of a particular type of documents via what is called document type definition (DTD).
A part of this chapter is devoted to analyzing the DTD for one particular SGML application, HTML
version 4.0 (code-named Cougar). Besides (and before) the DTD, some general features of an
SGML application are specified in another formal construct called the SGML declaration, which is
detailed in the next section.

As for SGML syntax, suffice it to say beforehand that it is pretty close to the syntax of HTML. You
will see that SGML statements, like HTML tags, are enclosed in angle brackets (<>) and contain a
keyword or name followed by one or more parameters separated by spaces. The only consistent
difference is that SGML statements commonly have the ! character inserted between the open
delimiter < and the statement keyword, for example:
<!ELEMENT IMG - O EMPTY -- Embedded image -->

You must already be familiar with one type of statement that uses the <! syntax, namely
comments in HTML documents that are enclosed in <!-- and -->. That's because the comment
syntax of HTML is directly borrowed from SGML, where everything within a <! statement enclosed
in double hyphens (--) is ignored by the SGML parser. For example, the words Embedded image
in the preceding code line are intended as a comment for human readers only.

One more <!-type declaration that needs to appear in HTML files is DOCTYPE, discussed briefly
later in this chapter.
SGML Declaration for HTML 4.0

GML declaration is a formal construct used to specify some general information about an SGML
application and its associated document type. The following sections list and analyze the SGML
declaration for HTML 4.0 provided by W3C.

The SGML declaration is contained in the SGML statement, which has the following syntax:
<!SGML "ISO 8879:1986" ... >

The ellipsis here represents the body of SGML declaration, and the string ISO 8879:1986 is
meant to denote the level of SGML standard that this declaration conforms to. In our case, this is
the original ISO specification published in 1986.

In the body of the declaration, first comes the comment part:

--
SGML Declaration for HyperText Markup Language
version 4.0

With support for Unicode UCS-2 and increased limits


for tag and literal lengths etc.
--

The rest of the declaration body is divided into sections that are described next.

You might also like