You are on page 1of 41

Data Communication I

Lecture 8

presented by
Werner Wild, CEO Evolution Innsbruck, San Francisco, Zurich

Contact: info@evolution.at

Todays Lecture
XML overview (XML eXtensible Markup Language) Building Documents and DTDs (DTD Document Type Declaration)

Links
http://www.w3.org/XML/ http://xml.apache.org/xerces-j/index.html

XML, What Is It?

Basically a markup language (ML)


Text with "annotations" (markup) Strongly structured representation of arbitrary data Binary data may be included after encoding (e.g. Base64) "ASCII of the Web"

Family of technologies
XML, XLink, XPath, XPointer, XSL, XSLT, ...

History
Forerunner: SGML (Standard Generalized Markup Language) SGML is very powerful and complicated XML is a subset XML 1.0 standard released 02/2000 (W3C Recommendation)

XML Benefits
<?xml version="1.0"?> <mail-addressbook> <mail-address category="professor"> <name>Werner Wild</name> <email>info@evolution.at</email> </mail-address> <mail-address category="assistant"> <name>Landon Bradshaw</name> <email>landon@bradshaw.org</email> </mail-address> </mail-addressbook>

Readable: meaningful tags support understanding Open: platform-independent data representation Structured: easy to parse and process

XML Syntax

Usable characters in XML documents: almost all unicode


Tab (0x09), line-feed (0x0A), carriage return (0x0D) ASCII display characters (0x20-0x7E) Unicode (0x80-0xD7FF), Private Use Area (0xE0000xF8FF) Chinese-Japanese-Korean characters (0xF900-0xFFFD)

XML name format (case-sensitive!)


Must begin with a letter, underscore (_) or colon (:) Valid name characters: letters, _, :, digits, -, . Colon should not be used except as namespace delimiter Must not begin with xml (reserved)

The following are legal names (though maybe insensible)


_, Aqene, zork

XML Elements

What elements are


Basic building blocks of XML documents May contain other elements, character data, entities, ... All XML data (few exceptions) must be contained in elements

Form of elements
Enclosed between a start tag (<...>) and an end tag (</...>) Start and end tag must have the same name End tags may not be left out as in HTML Special form for tags that don't enclose content (empty tags)

HTML: <br> (empty tag, no corresponding end tag) XML: <br/> (empty tag)

XML Elements

XML documents have strict tree structure


Tags must be nested correctly
<doc>

<doc> <tag>content</tag> <nop/> <other>more news</other> </doc>

<tag>

<nop>

<other>

<doc>

<tag>

<nop>
Alternative tree representation (emphasizes node ordering) <other>

<doc> <tag>content <other>more </tag>news </other> </doc>

This is illegal!

XML Elements

String literals
Either "..." or '...'
Don't mix delimiters!

Attributes
If tags are nouns, attributes are adjectives

<mail-address category="professor">

All attributes are strings!

XML Elements

Character references
Represent a displayable character that can't be used literally Form

&#NNNNN; decimal unicode number &#xXXXX; hexadecimal unicode number

Copyright sign : &#169; or &#xA9;

Entity references
Form: &name;, where name is a legal XML name Predefined entities

&amp;: & character &lt;, &gt;: < and > characters &apos;, &quot;: ' and " characters

Entities may be defined in a DTD

XML Elements

Processing instructions
Form: <?target instruction?>
Information for XML file processors target is required, instruction is just some string Usage depends on processors

Comments
Form: <!--...--> May contain any string except --

XML Elements

CDATA sections
For including text containing markup characters Form: <![CDATA[...]]> CDATA sections can not be nested!

Usage example
<doc> <explanation>Next, we will see some XML code.</explanation> <![CDATA[ <?xml version="1.0"?> <some-xml> This is some XML code. Use with care! </some-xml> ]]> <explanation>That was it.</explanation> </doc>

XML Document Structure

Prolog (optional)
Signal the beginning of XML data, describe character encoding Provide additional information for parser/application

DTD (Document Type Definition, document "grammar") Processing instructions

Body
Actual data

Epilog (optional)
"A real design error" (Tim Bray, W3C Rec. co-author) Not covered

Sample XML Document

prolog with DTD

<?xml version="1.0"?> <!DOCTYPE mail-addressbook [ <!ELEMENT mail-addressbook (mail-address*)> <!ELEMENT mail-address (name, email+)> <!ATTLIST mail-address category (professor | assistant) #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> ]> <mail-addressbook> <mail-address category="professor"> <name>Werner Wild</name> <email>info@evolution.at</email> </mail-address> <mail-address category="assistant"> <name>Landon Bradshaw</name> <email>landon@bradshaw.org</email> </mail-address> </mail-addressbook>

Sample XML Document


<?xml version="1.0"?> <!DOCTYPE mail-addressbook [ <!ELEMENT mail-addressbook (mail-address*)> <!ELEMENT mail-address (name, email+)> <!ATTLIST mail-address category (professor | assistant) #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> ]> <mail-addressbook> <mail-address category="professor"> document <name>Werner Wild</name> body <email>info@evolution.at</email> </mail-address> <mail-address category="assistant"> <name>Landon Bradshaw</name> <email>landon@bradshaw.org</email> </mail-address> </mail-addressbook>

XML Prolog: Declaration

XML declaration
Form:
<?xml version="1.0" encoding="..." standalone="..."?>

First entry in an XML document (no preceding white space!)


Parameter version: compulsory


Currently, the only possible and allowed value is "1.0"

Parameter encoding: optional


Specify character encoding used throughout the document UTF-8, UTF-16, ISO-8859-1, ...

Parameter standalone: optional


yes: all needed entity declarations contained in document no: external DTD needed

XML Prolog: Document Type Declaration


Note: this is not the DTD (Document Type Definition)


Document type declaration is a reference to an external DTD

Declaration forms
<!DOCTYPE root SYSTEM "sys-id"> <!DOCTYPE root PUBLIC "pub-id" "sys-id">

Parameters
root is the name of the document's root node sys-id is a URI pointing to the file containing the DTD

URIs may also be JNDI or LDAP names etc.

pub-id provides further information Use PUBLIC if widespread use is intended

XML Prolog: Document Type Declaration

DTDs can be located anywhere (URI sys-id)


local file

<!DOCTYPE mail-addressbook SYSTEM "mail-addressbook.dtd">

who did it

what is it

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">


DTD URI

DTDs Overview

What do DTDs provide?


Vocabulary for XML documents Formal structure "Grammar" on top of XML syntax

Why vocabularies?
XML documents are snapshots of domain data structures Used for communication between applications Fixed vocabulary helps in development

Why formal structures?


Check documents against it Ease error processing

XML Document Criteria

Well-formedness
Correct syntax (no overlapping tags, correct nesting, ...) Only one root node No references to external entities (unless a DTD is given)

Validity
Well-formedness plus Document conforms to a given DTD

Writing DTDs

Sample DTD for an e-mail addressbook


the document root is <mail-addressbook>, and it contains zero or more <mail-address> elements

yes, we need that

<?xml version="1.0" encoding="ISO-8859-1"?> <!ELEMENT mail-addressbook (mail-address*)> <!ELEMENT mail-address (name, email+)> <!ATTLIST mail-address category (professor | assistant) #IMPLIED> a <mail-address> element <!ELEMENT name (#PCDATA)> consists of a <name> and then of <!ELEMENT email (#PCDATA)> one or more <email> elements

the <name> and <email> elements contain textual data

<mail-address> elements have an optional attribute named category which can have one of the two values "professor" and "assistant"

Writing DTDs

Basic markup declarations


ELEMENT: declaration of an XML element type (tag) ATTLIST: declarations of attributes and possible values ENTITY: entity declarations (reusable content) NOTATION: format declarations for external data

Data is not meant to be parsed Specifying the application that handles the data Use with care: may be platform-dependent

Entities and notations are supportive


Entities are mere shortcuts Notations map "data types" to handlers

Defining Entities

General entities
Used within the contents of a document Define: <!ENTITY copyright "&xA9; 2002 Werner Wild"> Use: &copyright;

Parsed general entities


Replacement text can be found in an external file Define: <!ENTITY ext SYSTEM "include.txt">

Self-references are not allowed


Imagine a recursive entity for GNU (GNU's not Unix) This is wrong: <!ENTITY gnu "&gnu;&apos;s not Unix">
Remember the predefined entities!

Defining Entities

Parameter entities
Parsed entities used solely within the DTD Help keeping commonly used constructs in place

Imagine several tags with the same attribute lists


(For the moment, don't bother about attribute lists...)
<!ATTLIST tag1 attr1 CDATA #REQUIRED attr2 CDATA #REQUIRED> <!ATTLIST tag2 attr1 CDATA #REQUIRED attr2 CDATA #REQUIRED> ...and so forth...

The easy way: parameter entity


percent sign

<!ENTITY % attrlist "attr1 CDATA #REQUIRED attr2 CDATA #REQUIRED"> <!ATTLIST tag1 %attrlist;> use the percent <!ATTLIST tag2 %attrlist;> sign here as well ...and so forth...

Defining Notations

Bind non-XML data to applications


Examples:
name external application

<!NOTATION gif SYSTEM "/usr/bin/xv">

<!NOTATION jpg SYSTEM "jpegviewer.exe">

This may be platform-dependent!

Defining Elements

Easy: empty elements


Define: <!ELEMENT tagName EMPTY> This yields a possible tag <tagName/>

As easy: elements with arbitrary content


Define: <!ELEMENT stuffMe ANY> Anything may be placed between <stuffMe> and </stuffMe> Use this with care

Defining Elements

Elements with defined content


Define: <!ELEMENT name contentmodel>

Example: a person's name


<!ELEMENT person-name (first, middle, last)> this is ok the specified elements must

<person-name> <first>William</first> <middle>Henry</middle> <last>Gates</last> </person-name>

all occur in the given order

this is not ok: <middle> is missing <person-name> <first>Steve</first> <last>Jobs</last> </person-name>

this is not ok: wrong order

<person-name> <last>Gates</last> <first>William</first> <middle>Henry</middle> </person-name>

Defining Elements

The problem with middle names


Someone may have none Others may have n
cardinality operator *: zero or more

Improved version

<!ELEMENT person-name (first, middle*, last)>

Cardinality operators

Option (may or may not appear, zero or one): ? Zero or more: * One or more: +

Defining Elements

Redefinition of the <person-name> element


Person may have arbitrarily many middle names or one nickname Examples of possible names:

William Henry Gates Carl Philipp Emanuel Bach Douglas "42" Adams Wolfgang Amadeus "Vielschreiber" Mozart

The following is not right

<!ELEMENT person-name (first, (middle* | nick), last)> <person-name> <first>Douglas</first> <nick>42</nick> <last>Adams</last> </person-name>

alternative ("or")
someone with a nickname

Defining Elements

Then again...
Someone may have loads of middle and nicknames grouped Go for it:
<!ELEMENT person-name (first, (middle | nick)*, last)>

What about the other elements?


first, middle, nick and last contain text Possibilities: is the text intended to be parsed or not? Don't parse: #CDATA Parse: #PCDATA Markup in #PCDATA must be defined in the DTD
<!ELEMENT first (#PCDATA)> <!ELEMENT middle (#PCDATA)> ...and so forth...

Defining Attributes

Any XML element may have attributes (even empty ones) Basic structure of an attribute declaration:
name of the element the attribute belongs to name of the attribute itself attribute type default declaration

<!ATTLIST elementName attributeName CDATA #REQUIRED>

Default declarations

#REQUIRED: the attribute must appear #IMPLIED: optional attribute #FIXED "default": must have this value, can be left out "default": if the attribute is not present, default is assumed

Defining Attributes: Types

CDATA
Attribute may contain simple text without markup No elements, entities etc. allowed

ID
Attributes of that type are intended to have unique value Must be #REQUIRED or #IMPLIED Value must conform to XML naming rules IDs can be referenced

Defining Attributes: Types

IDREF, IDREFS
References to IDs Use with IDs to model data relationships with unique keys Values must conform to XML naming rules IDREFS attributes contain space-separated lists of IDs
<!ELEMENT <!ATTLIST <!ELEMENT <!ATTLIST person (person-name, address, ...)> person id ID #REQUIRED> customer EMPTY> customer id IDREF #REQUIRED>

<person id="42230815"> <person-name>...</person-name> <address>...</address> ... </person> ... <customer id="42230815"/>

Incorrect IDREFs yield validation errors!

Defining Attributes: Types

ENTITY, ENTITIES
Entities are constructions that appear several times Can also be used as attribute values

Four steps to do that


Declare a notation

<!NOTATION gif SYSTEM "/usr/bin/xv"> <!ENTITY myPicture SYSTEM "pic.gif"> <!ATTLIST elem pic ENTITY #IMPLIED>

Declare one or more entities for use with the attribute

Declare an attribute of type ENTITY

Create the external entity (of course)

Usage:
<elem pic="myPicture">...</elem>

Defining Attributes: Types

Enumerations
Attribute values may be restricted to several values Values must be valid XML names

<!ATTLIST mail-address category (professor | assistant) #IMPLIED> allowed values

NMTOKEN, NMTOKENS

Quite like enumerations, but none are predefined Values are not part of the grammar New ones can be added easily without modifying the DTD Here too, values must be valid XML names
<!ATTLIST mail-address category NMTOKEN #IMPLIED>

DTDs Discussion

DTD syntax
It's SGML, not XML Shouldn't XML documents be described in XML?

DTDs are closed constructs


Reusing parts of other DTDs is not directly possible

Solution under way: XML Schema


Describe document grammar in XML W3C Recommendation

Related Technologies

Namespaces
Help avoid name clashes Improve vocabulary reuse
default namespace

some other namespace

<?xml version="1.0"?> <someRoot xmlns="someRoot.dtd" xmlns:otherSpace="otherSpace.dtd"> <someElement> someElement was This belongs to someRoot! defined in someRoot.dtd </someElement> <otherSpace:elem> elem was defined in This belongs to otherSpace! otherSpace.dtd </otherSpace:elem> <someElement otherSpace:attr="imported attribute"/> </someRoot> yes, this is possible

Related Technologies

XLink
Linking to other resources from within an XML document Roughly analogous to hyperlinks

XPath
General specification (W3C) for access to document parts Defines an addressing mechanism

XPointer
Pointing to particular locations in or portions of documents Wraps XPath: standard mechanism to use addresses

XSL, XSLT
Transforming XML documents

Simply using XML to represent data does not make documents more expressive. They must also be well-structured.

< < >

>

<

>

The End

Thank you for your attention !

Sources

For the preparation of this lecture a lot of sources where used my special thanks go to :
Univ. Helsinki Univ. California San Diego (UCSD) Univ. Darmstadt many others

You might also like