ETD Guide/Technical Issues/SGML\XML Overview

SGML/XML is a multiple-targeted strategy. "It allows librarians to ensure longevity of digital dissertations. Modern hardware and redundancy can keep all the bits of an electronic thesis or dissertation (ETD) intact. But electronic archives must be modernized continually as new document formats become popular." As librarians always tend to think in decades, document formats like TIFF, Postscript or PDF do not meet their requirements. If PDF is replaced by another de facto (industry, not ISO-like) standard, preserving digital documents would mean converting thousends of documents. XML can help overcome those difficulties. If an electronic document is to be of ‘archival quality, it should be liberated from the page metaphor."

A second reason for using SGML/XML is that it ensures reusability of documents by preserving raw data and content-based structuring of information pieces. Preserving data for statistics and formulas in mathematics and chemistry could allow researchers to reuse and repeat simulations, calculations and experiments, deriving the needed data directly from an archive.

Third, using structured information allows the reuse of the same information or documents in different contexts, i.e., the same digital dissertation can be used to produce an online or print version, and to produce additional information products, like monthly proceedings containing the abstracts of all dissertations produced within the university during the last month, or a citation index. Additionally, the dissertation can be displaysd for different media, so a Braille reader or an automatic voice synthesizer could be used as a back-end machine.

Another reason for using markup for encoding documents is that a wider, more qualified retrieval could be provided to the users of an archive. As university libraries are more and more challenged by the problem of handling, converting, archiving and providing electronic publications, one of the major tasks is providing a new quality for retrieval within the user interface. Using an SGML/XML-based publishing concept enables a new quality in the distribution of scientific contents via specific information and knowledge management.

What does SGML/XML mean?

The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web. The current W3C Recommendations are XML 1.0, Feb '98, Namespaces, Jan '99, and Associating Stylesheets, Jun '99, and XSLT/XPath, Nov '99.( http://www.w3.org/XML ) The development of XML started in 1996 and it is a W3C (http://www.w3.org/) standard since February 1998, which may make you suspect that this is rather immature technology. But in fact the technology isn't very new.

Before XML there was the Standard Generalized Markup Language (SGML), developed in the early '80s, an ISO standard since 1986, and widely used for large documentation projects. And of course HTML, whose development started in 1990. The designers of XML simply took the best parts of SGML, guided by the experience with HTML, and produced something that is no less powerful than SGML, but vastly more regular and simpler to use. While SGML was mostly used for technical documentation and much less for other kinds of data, with XML it is the opposite.

"Structured data", such as mathematical or chemical formulas, spreadsheets, address books, configuration parameters, financial transactions, technical drawings, etc. are usually put on the Web using the output of layout programs as Postscript or PDF or by putting them into graphic formats like gif, jpeg, png, vrml, and so on. Programs that produce such data often also store it on disk, for which they can use either a binary format or a text format. So, if soemebody wants to look at the data, he usually needs the program that produced it. With XML those data could be stored in a text format, which allows the user reading the file without having the original program. XML is a set of rules, guidelines, conventions, whatever you want to call them, for designing text formats for such data, in a way that produces files that are easy to generate and read (by a computer).

The eXtensible Markup Language (XML) is a markup or structuring language for documents, a so-called metalanguage, that defines rules for the structural markup of documents independently from any output media. XML is a "reduced" version of the Structured Generalized Markup Language (SGML), which has been an ISO-certified standard since 1986. In the field of internet publishing, it never achieved wide success due to the complexity of the standard and the high cost of the tools. It prevailed only in certain areas, such as technical documentation in large enterprizes (Boeing, patent information). The main philosophy of SGML and XML is the strict separation of content, structure and layout of documents. Most ETD projects use either the SGML standard (ISO 8879 with Korregendum K vom 4.12.1997) or the definition of the World Wide Web Consortium (W3C) XML 1.0 (10.02.1998, revised 6.10.2000). The crux of all those projects was always the document type definition (DTD).


Next Section: SGML/XML and other Markup Languages