WWW Information Pack
Topics covered: XML, XSLT transformations, validation, DTDs, schema, interoperability, IMS, SCORM, e-GIF, SMIL, MathML, RDF, semantic web, RSS, syndication, SOAP, web services
XML (Extensible Markup Language) is an extremely powerful new technology, which is radically changing the way in which information is exchanged on the Internet.
XML has a number of features in common with HTML: it is a markup language, i.e. it uses tags to describe information in a structured way, so that it can be stored and transported (principally via the web, using the HTTP protocol). However, XML is much more powerful and flexible than HTML, and it is suited to a different set of uses.
The original goal of HTML (as devised by Tim Berners-Lee in 1989) was to enable the construction of cross-linked "screens" of information, whose content is displayed as a single "web page" by a browser. As HTML was developed and extended, it was increasingly used to create visually appealing web page designs. However, the underlying HTML (and XHTML) language is limited to the standard set of tags defined by the W3C (such as <h1> <table> etc.), so the HTML tags by themselves do not give a very clear indication of the precise meaning of the information that they contain - they only mark it out as "a heading" or "a table".
XML is not intended as a replacement for HTML, and it is certainly not a design tool. The principal goal of XML is to store and transport information in a far more structured manner. Information stored in XML is self-describing - i.e. it can carry with it a much richer description of what each element of that information actually means.
XML (unlike HTML) is also extensible - which means that you can create your own tags to meet the requirements of your own information.
So, for example, if you wanted to publish some simple information about a set of books, in old-fashioned HTML you might author it as:
<h1>My holiday reading list</h1>
<p>
<b>Title:</b> The Third Man
<br>
<b>Author:</b> Graham Greene
<br>
<b>Year:</b> 1950
</p>
<p>
<b>Title:</b> Animal Farm
<br>
<b>Author:</b> George Orwell
<br>
<b>Year:</b> 1945
</p>
and so on...
From a computer's point of view, your HTML document simply contains a Level-1 heading and a number of paragraphs, each of which contains some formatted text and some line-breaks. There is no way that the computer can process this HTML information intelligently, to apply more sophisticated rules such as:
However, in XML, this kind of intelligent processing is possible. You might choose to store your book list information as follows:
<?xml version="1.0" encoding="ISO-8859-1"?> <bibliography> <bibliography_title> My holiday reading list </bibliography_title> <book> <title>The Third Man</title> <author> <surname>Greene</surname> <forename>Graham</forename> </author> <year>1950</year> </book> <book> <title>Animal Farm</title> <author> <surname>Orwell</surname> <forename>George</forename> </author> <year>1945</year> </book> </bibliography>
Example 27.1 - An XML representation of a simple reading list (the line indentations are not strictly required, but are provided here to emphasise XML's nested structure)
A computer program that can read and interpret XML (which is called an XML parser) will now be able to extract a great deal of meaning from the above information. It "knows" that you have chosen to define a category of information called a bibliography, inside which you have created sub-elements called books. Inside each book element, you then have title, author and year elements, and so on.
Internet Explorer 6 contains a basic XML parser: if you save the above XML information into a file called booklist.xml and then open this file with IE6, you find that you can expand or contract each of the elements within the XML structure, by clicking on the small plus (+) or minus (-) signs. The Mozilla browser can do likewise.

Figure 27.1 - The above XML file viewed in Internet Explorer 6. Note that the second book element is collapsed in this view, and could be expanded by clicking on the small + sign
This hierarchical tree-like structure is a fundamental characteristic of XML, and the points at which such expansion or contraction is possible are called nodes. If you try to depart from the logic of this nested structure, for example:
<surname>
Greene
<forename>Graham
</surname>
</forename>
then you will find that IE6 refuses to display the file. XML is required to conform to a logical nested structure, in order to be said to be well-formed. Parsers will usually generate an error when asked to deal with XML files that are not well-formed.
XHTML is simply a re-working of the old HTML language, so that it is now compliant with XML syntax. This enforces rather stricter rules on web page construction: all tags must be in lower-case, properly nested, and properly closed, if they are to be valid XHTML.
XHTML straddles the gap between old HTML and new XML: pages written in XHTML can be read by XML-enabled devices AND can also be displayed in the same way as conventional web pages in today's generation of web browsers.
But - XHTML still suffers from the same deficiency as HTML above: it is limited to a standard set of tags: you cannot define your own information structures.
The above discussion of XML structure and syntax will probably still seem rather academic. So what practical use can we make of XML's inherent descriptive power?
One of the first things that you might want to do with a file of XML data is to display it attractively in a browser. This is easily accomplished, using XML's own equivalent of stylesheets, a technology called XSLT (Extensible Stylesheet Language for Transformations)
XSLT allows you to process the information contained within an XML file, using formatting rules that are contained within a linked style sheet (XSL) file. By doing this, you transform the original XML file into a format that is more suited to your needs - e.g. an XHTML web page, or perhaps a WML page for display on a WAP mobile 'phone, or even perhaps just a page of plain text.
For example, we can transform our XML book list above (Example 27.1) back into a conventional web page, by means of an XSL style sheet file. To do this, we first insert a single line of code near to the top of our booklist.xml file, to specify a link to the new XSL file that we will be creating:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="transform.xsl"?>
<bibliography>
etc...
In the same directory, we then place an XSL stylesheet, with the filename transform.xsl. This XSL file contains a mixture of formatting information and references back to the data within the original XML file:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html><body> <h1>My book list</h1> <ul> <xsl:for-each select="bibliography/book"> <xsl:sort select="year"/> <li> <xsl:value-of select="title"/> - <xsl:value-of select="author/surname"/><br /> <xsl:value-of select="year"/> </li> </xsl:for-each> </ul> </body></html> </xsl:template> </xsl:stylesheet>
Example 27.2 - An XSL style sheet to transform our reading list (booklist.xml) from XML into a web page
Now when we open our booklist.xml file in IE6, the display is quite different:

Figure 27.2 - The same booklist.xml file, viewed in IE6, after it has been transformed by the linked XSL style sheet file
The linked XSL style sheet has transformed the XML data into a conventional XHTML web page. A subset of the book data from the original XML file is now displayed in a simple bullet list format.
Note that the XSL file refers to elements of data by their position within
the "tree" structure of the original XML file - i.e. their node
position.
So, for example, the author's surname is at the node position:
/bibliography/book/author/surname
The <xsl:for-each select="bibliography/book"> statement in the style sheet loops through each /bibliography/book/ node in the XML file, one by one. The <xsl:sort select="year"/> statement then sorts these items by the book's year of publication, so the resultant web page displays the earlier book (Animal Farm) first, despite the fact that it was listed second in the original XML file.
The real power of XML becomes apparent when we start to apply more sophisticated logic to our XSL processing rules:
(omitting the surrounding header and footer code) <h1>My book list (1950 or later)</h1> <ul> <xsl:for-each select="bibliography/book"> <xsl:sort select="year"/> <xsl:if test="year > 1949"> <li> <xsl:value-of select="title"/> - <xsl:value-of select="author/surname"/><br /> <xsl:value-of select="year"/>
</li> </xsl:if> </xsl:for-each> </ul>
Example 27.3 - A revision to our transform.xsl style sheet file, to apply more data processing logic
The <xsl:if test="year > 1949"> statement here restricts the visual output to only books for which the year value is greater than 1949. So now when the booklist.xml file is viewed in IE6, it only lists those books published in 1950 or later:

Figure 27.3 - The result of a more sophisticated XSL transformation
Likewise we could choose to display only those books whose author's surname contains the text "Orwell":
<xsl:for-each select="bibliography/book">
<xsl:if test="contains(author/surname, 'Orwell') ">
...data display in here as before...
</xsl:if>
</xsl:for-each>
and so on.
By applying different XSLT transformations to a single XML file, you can therefore generate many different "views" of the same underlying set of information, quickly and flexibly.
For example, if you were running a book shop, a suitable XML representation of your stock list could be transformed instantaneously to display "just the post-war fiction" or "everything under £12.99", etc. Likewise you could readily switch between a tabular display, a bullet list format, or pehaps a large-font version for people with visual difficulties.
A single set of core XML information can therefore be adapted for many different uses, across multiple display media, without manually sorting and re-formatting it every time. This is the principal advantage of XML over older HTML-based information.
The above examples make use of IE6's ability to parse XML documents and apply XSLT transformations. They are client side applications of XML technology: if you wanted to use XML and XSLT as above, you would need to be assured that everybody viewing your data was using an XML/XSLT-capable client browser. Older browsers such as Netscape 4 would not be able to display your XML-based information.
You may prefer to apply server side technologies to processing and transforming XML data. The PHP server-side scripting language, for example, is very often installed with an additional toolkit called Sablotron, which enables it to apply XSLT transformations easily in your PHP scripts:
<?php
$webpath=" [insert full UNIX path to your web directory here] ";
$xmlfile=$webpath."/booklist.xml";
$xslfile=$webpath."/transform.xsl";
$xh = xslt_create();
$result = xslt_process($xh,$xmlfile,$xslfile,NULL);
if ($result) {
print $result;
}
else {
print "Error: ".xslt_error($xh);
}
?>
Example 27.4 - A PHP script to transform the XML file booklist.xml with the XSL style sheet transform.xsl on the web server
The above PHP script will transform our original booklist.xml file with the style sheet transform.xsl, thereby generating an XHTML output page as before. Since this transformation is now done on the web server, the resultant XHTML page returned by the above script should be viewable by any browser.
To use the above code, adapt it to include the full UNIX path of the web directory you are using, save it in a file called booklist.php, upload it into your web directory along with the XML and XSL files, and then run to make them available on the web. View the "live" version of the booklist.php web page in your browser.
We have already explained that XML documents must adhere to certain logical nesting and formatting rules before they can be said to be well-formed, and before a parser will attempt to process them. However, it is still possible to create XML documents that are well-formed but make absolutely no logical sense:
<book> <title>The Third Man</title> <author> <surname>Greene</surname> <shopping>Cheddar cheese</shopping> </author> </book>
There is no logical reason why a <shopping> element should suddenly appear within the author details of a book - it is not at all relevant to the data being described. If we were running an XML-based book listing service, we would wish to make sure that our XML data contained only elements that we had decided upon, structured in a consistent way according to our own rules.
This higher-level form of checking can be achieved by validating your XML against either a DTD (document type definition) or a schema. These are two alternative methods of declaring a formal set of rules for the tags and information structures to be used within an XML document. In our book listing example above, a DTD or schema could specify the rule that an <author> element was only allowed to contain a single <surname> and a single <forename> element.
If an XML document complies with the rules specified in its associated DTD or schema, it is said to be valid. Some types of XML parsers can validate XML documents in this way (for examples see: www.stg.brown.edu/service/xmlvalid/ and www.cogsci.ed.ac.uk/~richard/xml-check.html) whilst others are non-validating parsers (see wdvl.internet.com/Software/XML/parsers.html)
DTDs are slightly less complex than schema to construct and understand: an
XML document may reference an external DTD file (bibliography.dtd)
to which it conforms, using a DOCTYPE declaration:
<!DOCTYPE bibliography SYSTEM "bibliography.dtd">
The DTD file itself is NOT written in XML, but simply contains a listing of element definitions as follows:
<!ELEMENT bibliography (bibliography_title,
book+) >
<!ELEMENT bibliography_title (#PCDATA) >
<!ELEMENT book (title, author+, year) >
etc.
In contrast, schema are themselves written in XML. They are more flexible than DTDs, because they can specify more sophisticated data type rules, e.g. "a book's ISBN number must contain 10 digits exactly". They also require the use of namespaces - a method of formally identifying "vocabularies" of XML tags to be used in different contexts. They are a more modern development than DTDs, but tend to be longer and more complex to understand.
The full technical details of DTD and schema construction are considerably beyond the scope of this Factsheet. The use of specialised software such as XMLSpy is advisable, rather than attempting to type and de-bug large files full of complex code. Specialised literature such as the O'Reilly books is also recommended.
Because of its rich data structure and ease of transport across the Internet, XML has been adopted as the core technology for many large and complex computing problems:
For several years now, the British government has sought to "join up" the operations of many public sector departments (e.g. health care, social care, housing, police) so that they can offer a more efficient and integrated approach to providing services. In order to do this, it is essential that these organisations should be able to exchange information about people and services between themselves in a secure, structured and consistent format.
In the past, efforts to integrate services in this way have been hampered by the fact that the various organisations typically used a range of incompatible proprietary database systems, that could not exchange data with one another effectively.
However, XML offers the opportunity to overcome this problem: the e-GIF (e-Government Interoperability Framework) now seeks to encourage data transfer between public sector organisations, by developing an agreed set of XML data structures (technically known as XML schema) as a standard format for data exchange.
So regardless of the specific proprietary database systems in use by the various organisations, they should still be able to "talk to one another" via their ability to export and import standardised XML data. Such systems are therefore referred to as interoperable.
The e-GIF schema are published at www.govtalk.gov.uk
Internet-based teaching and learning has become an extremely valuable and important activity, both within traditional universities and colleges and more widely via initiatives such as LearnDirect. Many commercial companies have now developed "virtual learning environments" - web-based software systems that allow tutors to publish their learning materials, and students to log in and progress through them, in a structured manner.
However, most of these competing software systems were originally incompatible with one another. The considerable investment of time and effort involved in building a course stucture for one learning environment needed to be repeated if the same materials were also to be delivered by a different environment.
To overcome this limitation, a range of educational and government agencies, together with some commercial software developers, have been working together to develop a set of XML-based technical specifications that can be adopted by the developers of e-learning software, so that their products are interoperable. Learning materials and complex course structures can then be exported from one and imported directly into another.
The principal outcomes of these efforts are:
The IMS specifications at www.imsglobal.org : these include standard XML document structures (DTDs) for describing the "packaging" of learning materials into exportable units, for defining the structure of multiple choice question sets, etc.
SCORM (the Sharable
Content Object Reference Model), a closely related initiative specifying
the methods by which learning environments should implement interoperability:
www.rhassociates.com/scorm.htm
www.adlnet.org/index.cfm?fuseaction=scormabt
www.altrc.org/specification.asp
SMIL stands for Synchronised Multimedia Integration Language and is an XML-based method for delivering multimedia content via the web. It has been adopted as a standard by the W3C. Its main strength is the ability to synchronise the playing of many separate objects, for example:
<smil>
// - head region omitted for brevity - //
<body>
<par dur="60s">
<audio src="soundtrack.mp3" begin="0s" />
<textstream src="scrollcaption.txt" begin="10s" />
<seq>
<img src="image1.gif" begin="10s" />
<img src="image2.gif" begin="30s" />
<img src="image3.gif" begin="45s" />
</seq>
</par>
</body>
</smil>
The above SMIL code would play an audio file (soundtrack.mp3), display a scrolling text caption (scrollcaption.txt) after 10 seconds, and show a sequence of three GIF images one after another, all within a scheduled 60 second presentation.
(Note that SMIL files typically also contain a <head> region, which controls the visual layout of the presentation - see www.w3.org/AudioVideo/ and www.bu.edu/webcentral/learning/smil1 for full SMIL authoring details).
Both Apple's QuickTime player (www.apple.com/quicktime) and RealNetworks' RealPlayer (www.real.com) can play SMIL presentations. Microsoft has developed a similar language, HTML+TIME, which plays in IE6 although its syntax is somewhat different from SMIL: msdn.microsoft.com/library/default.asp?url=/workshop/author/behaviors/time.asp
The incorporation of mathematical expressions into web pages has long been a problem. Standard HTML makes no provision for integral signs, matrices, etc., and has only limited support for superscripts and subscripts. Many mathematical web authors have been forced to resort to creating GIF images of their mathematical equations, and inserting these into their web pages with the <img> tag.
For this reason, the W3C has supported the devlopment of Mathematical Markup Language (MathML) - an XML-based method of describing and publishing mathematical expressions in web pages. Examples of MathML usage are given at www.w3.org/TR/REC-MathML/chapter2.html and studentwebs.colstate.edu/cato_john/tootomatic/MathML/MathML-7-1.html
At the time of writing, the Mozilla browser has some native support for displaying pages which mix MathML expressions with more conventional XHTML content (Figure 27.4). Most other browsers still require a special plug-in.

Figure 27.4 - A MathML expression embedded within an XHTML page, viewed in Mozilla v.1.3. The underlying code (left) is somewhat complex.
Fortunately, tools are now being developed to simplify the process of MathML authoring and viewing. EzMath from www.w3.org/People/Raggett/EzMath/ functions as both a browser plug-in and a MathML authoring tool.
Metadata is commonly used on the web to provide descriptive information about web pages (e.g. their author, date, keywords etc.). Currently, this is most often done in a simplistic way (using <meta> tags in XHTML) to optimise the chances that pages will be indexed by the large search engines.
Resource Description Framework (RDF) is a newer and much more powerful application of XML to the problem of organising metadata. By expressing metadata in XML, it aims to improve the scope for automated processing and interchange of web resources, and thereby to enable the development of more "intelligent" search and retrieval tools for web information. The popular and widely-used Dublin Core framework for specifying metadata has now been implemented in RDF: www.dublincore.org/documents/2002/07/31/dcmes-xml/
RDF provides the underpinning technology for an initiative called the semantic web (www.w3.org/2001/sw). This aims to promote the development of a more integrated web, in which information resources are self-describing and can be automatically processed and retrieved more accurately.
Excellent introductions to this somewhat abstract topic are available at www.xml.com/pub/a/2001/01/24/rdf.html?page=1 and www.w3.org/RDF/FAQ
Many news-based web sites now offer an exportable "news feed" in XML format. This usually consists of a simple XML file that contains their latest news headlines, plus links back to the relevant story on their site. Any other site may then freely import this news feed, and process it for display on their own web site. This swapping of information content on the web is known as syndication, and the XML format used is usually a version of RSS (which stands for RDF Site Summary).

Figure 27.5 - The DevShed web site (a magazine and news site for web developers) offers its latest headlines in RSS format (left). These can be imported, parsed with a simple PHP or perl script, and displayed on your own web site (right).
One of the best sources of RSS format news feeds is currently www.moreover.com which supplies news headlines and sport. An excellent guide to RSS is available at www.mnot.net/rss/tutorial/
As the web becomes capable of delivering ever more sophisticated services, it is desirable that these applications should be able to exchange data with one another. SOAP (Simple Object Access Protocol) is an XML-based protocol designed to enable this process. Using SOAP, the web can be used to exchange complex data, undertake automated transactions and invoke services remotely from third party application providers.
One simple example of a SOAP-based web service is provided by the Google search engine. It is possible to send a search query to Google's databases via the SOAP protocol, and Google will send the search results back in an XML format SOAP response. This means that you can build web applications that query the Google search index automatically, process the results and display them within your own site in any manner you choose. Techniques for doing this are described at www.devshed.com/Server_Side/PHP/GoogleAPI/page1.html and www.google.com/apis/ .
More complex SOAP-based web services are provided by large e-commerce enterprises such as Amazon.com and eBay. For example, you can use Amazon's web services to place a "storefront" allowing users to search and buy Amazon products from your own web site - see www.amazon.com/gp/aws/landing.html and www.cybaea.net/Publications/Business Platforms.html.
SOAP is also an important component of Microsoft's development plans for the .NET platform (www.microsoft.com/net/), which aims to integrate local desktop applications more closely with remote web services.
More information about SOAP is available from www.w3schools.com/soap/default.asp and www.w3.org/TR/SOAP/
XML is such an immense subject that no single resource will cover all of its aspects. Nevertheless, the following should prove useful:
Devshed XML section (particularly useful information about using XML in server-side
web applications)
www.devshed.com/Server_Side/XML
www.w3schools.com - follow the links to various XML tutorials.
www.xml.com - contains useful "FAQ" guides to all aspects of XML
Learning XML by Erik T. Ray, published in 2001 by O'Reilly, ISBN 0-596-00046-4 - a good all-round introduction.
"How to use XML" by John Shelley, published in 2002 by Babani computer books, ISBN 0-85934-532-7 - contains in-depth details of DTDs, schema and namespaces, but nothing on XSLT.
All the XML books in print: www.xmlbooks.com