ETD Guide/Technical Issues/SGML\XML and Other Markup Languages

SGML (Standard Generalized Markup Language) and XML (eXtensible Markup Language) are markup languages, which use tags ("<" and ">") with names of labels inside around the sections of the documents that are thus marked or bracketed. Document Type Definition (DTD) specifies the grammar or structure for a type or a class of documents. SGML requires a DTD while XML employs DTD optionally. But given current trends it seems that XML is most likely to be used due to the following reasons.

XML is a method for putting structured data in a text file for "structured data" think of such things as spreadsheets, address books, configuration parameters, financial transactions, technical drawings, etc. Programs that produce such data often also store it on disk, for which they can use either a binary format or a text format. The latter allows you, if necessary, to look at the data without the program that produced it. XML is a set of rules, guidelines, conventions, whatever you want to call them, for designing text formats for such data, in a way that produces files that are easy to generate and read (by a computer), that are unambiguous, and that avoid common pitfalls, such as lack of extensibility, lack of support for internationalization/localization, and platform dependency.
XML looks a bit like HTML but isn't HTML
Like HTML, XML makes use of tags and attributes (of the form name="value"), but while HTML specifies what each tag and attribute means (and often how the text between them will look in a browser), XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data completely to the application that reads it. In other words, if you see "<p>" in an XML file, don't assume it is a paragraph. Depending on the context, it may be a price, a parameter, a person. In short it allows you to develop your own mark up language specific to a particular domain.
XML documents can be preserved for a long time.
XML is, at a basic level an incredibly simple data format. It can write in 100 percent pure ASCII text as well as in a few other well-defined formats. ASCII text is reasonably resistant to corruption. Also XML is very well documented. The W3C’s XML 1.0 specification tells us exactly how to read XML data.
XML is license-free, platform-independent and well-supported.
By choosing XML as the basis for some project, you buy into a large and growing community of tools (one of which may already do what you need!) and engineers experienced in the technology. Opting for XML is a bit like choosing SQL for databases: you still have to build your own database and your own programs/procedures that manipulate it, but there are many tools available and many people that can help you. And since XML, as a W3C technology, is license-free, you can build your own software around it without paying anybody anything. The large and growing support means that you are also not tied to a single vendor. XML isn't always the best solution, but it is always worth considering.
XML is a family of technologies.
There is XML 1.0, the specification that defines what "tags" and "attributes" are, but around XML 1.0, there is a growing set of optional modules that provide sets of tags & attributes, or guidelines for specific tasks. There is, e.g., Xlink which describes a standard way to add hyperlinks to an XML file. XPointer & XFragments are syntaxes for pointing to parts of an XML document. (An Xpointer is a bit like a URL, but instead of pointing to documents on the Web, it points to pieces of data inside an XML file.) CSS, the style sheet language, is applicable to XML as it is to HTML. XSL is the advanced language for expressing style sheets. The DOM is a standard set of function calls for manipulating XML (and HTML) files from a programming language. XML Namespaces is a specification that describes how you can associate a URL with every single tag and attribute in an XML document. What that URL is used for is up to the application that reads the URL, though. XML Schemas help developers to precisely define their own XML-based formats. There are several more modules and tools available or under development.
XML provides Structured and Integrated Data XML is ideal for large and complex data like ETD’s because data is structured. It not only lets you specify a vocabulary that defines the elements in the document; but it also allows you to specify relations between the elements.
XML can encode metadata about DTD’s.
Documents are often supplemented with metadata (that is data about data). If such metadata were included inside an ETD then it would make ETD self-describing. XML can encode such metadata. However on the downside XML comes with its own bag of discomforts.

Conversion from word processing forms to XML requires more planning is advance, different tools and broader learning about processing concepts than it is required for PDF.
There are many fewer people knowledgeable about these matters and tools that support this conversion are less mature and expensive. Also process of converting may be complicated, difficult and time consuming.
Writing directly in XML by using XML authoring tools requires some prior knowledge of XML.
Also XML is very strict regarding the naming and ordering of tags. It is also case sensitive illustrating the relative effort required by students to prepare ETD’s in this form.

Process of Creating an XML document

XML documents have four-stage life cycle.

XML documents are mostly created using an editor. It may be a basic text editor like notepad. or .vi. editor. We may even use WYSIWYG editors. The XML parser reads the document and converts it into a tree of elements. The parser passes the tree to the browser that displays it. It is important to know that all this processes are independent and decoupled from each other.

Putting XML to work for ETD’s Before we jump into the XML details for ETD.s we should make certain things clear, since we would be using them on a regular basis now onwards.

DTD (Document Type Definition):

An XML document primarily consists of a strictly nested hierarchy of elements with a single root. Elements can contain character data, child elements, or a mixture of both. The structure of the XML document is described in the DTD. There are different kinds of documents like letter, poem, book, thesis, etc. Each of the documents has its own structure. This specific structure is defined in a separate document called Document Type Definition (DTD).

DTD used is based on XML and it covers most of the basic HTML formatting tags and also some specific tags from the Dublin core metadata. A DTD has been developed for ETD. The developed DTD is too generic. If someone wants to use mathematical equation or incorporate some chemical equation, it won't be sufficient. For that we can incorporate MathML (Mathematical Markup Language) and/or CML (Chemical Markup Language). There are defined DTDs for these languages that we also have to use for our documents. But research of incorporating more that one DTD for different parts of the documents is still going on.

CSS (Cascaded Style Sheets):

CSS is a flexible, cross-platform, standards-based language used to suggest stylistic or presentational features applied throughout entire websites or web pages. In their most elegant forms, CSS are specified in a separate file and called from within the XML or HTML header area when documents loads into the CSS-enabled browser. Users can always turn off the author's styles and apply their own or mix their important styles with the authors. This points to the "cascading" aspect of CSS.

CSS is based on rules and style sheets. A rule is a statement about one stylistic aspect of one or more elements. A style sheet is one or more rules that apply to a markup document.

An example of a simple style sheet is a sheet that consists of one rule. In the following example, we add a color to all first-level headings (H1). Here's the line of code - the rule - that we add:

H1 {color: red}

XSL (the eXtensible Stylesheet Language):

XSL is a language for expressing stylesheets. It consists of two parts:

A language for transforming XML documents, and
An XML vocabulary for specifying formatting semantics.

If you don't understand the meaning of this, think of XSL as a language that can transform XML into HTML, a language that can filter and sort XML data, a language that can address parts of an XML document, a language that can format XML data based on the data value, like displaying negative numbers in red, and a language that can output XML data to different devices, like screen, paper or voice. XSL is developed by the W3C XSL Working Group whose charter is to develop the next version of XSL.

Because XML does not use predefined tags (we can use any tags we want), the meanings of these tags are not understood: <table> could mean an HTML table or maybe a piece of furniture. Because of the nature of XML, the browser does not know how to display an XML document.

In order to display XML documents, it is necessary to have a mechanism to describe how the document should be displayed. One of these mechanisms is CSS as discussed above, but XSL is the preferred style sheet language of XML, and XSL is far more sophisticated and powerful than the CSS used by HTML.

XML Namespaces

The purpose of XML namespaces is to distinguish between duplicate element type and attribute names. Such duplication might occur, for example, in an XSLT stylesheet or in a document that contains element types and attributes from two different DTDs.

An XML namespace is a collection of element type and attribute names. The namespace is identified by a unique name, which is a URI. Thus, any element type or attribute name in an XML namespace can be uniquely identified by a two-part name: the name of its XML namespace and its local name. This two part naming system is the only function of XML namespaces.

XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace. The declaration is in scope for the element containing the attribute and all its descendants. For example code below declares two XML namespaces. Their scope is the A and B elements:
<A xmlns:foo="http://www.foo.org/" xmlns="http://www.bar.org/">abcd</A>
If an XML namespace declaration contains a prefix, you refer to element type and attribute names in that namespace with the prefix. For example code below declare A and B in http://www.foo.org namespace, which is associated with the foo prefix:
<foo:A xmlns:foo="http://www.foo.org/"> <foo:B>abcd</foo:B> </foo:A>
If an XML namespace declaration does not contain a prefix, the namespace is the default XML namespace and you refer to element type names in that namespace without a prefix. For example, code below is same as previous example but uses a default namespace instead of foo prefix:
<A xmlns="http://www.foo.org/"><B>abcd</B></A>

Glossary

attribute

XML structural construct. A name-value pair within a tagged element that modifies certain features of the element. For XML, all values must be enclosed in quotation marks.

cascading style sheets (CSS)

Formatting descriptions that provide augmented control over presentation and layout of HTML and XML elements. CSS can be used for describing the formatting behavior of simply structured XML documents, but does not provide a display structure that deviates from the structure of the source data.

CDATA section

XML structural construct. CDATA sections can be used to mark tags or reserved characters with quotation marks and thus prevent them from being interpreted. For this reason, the CDATA section is especially useful for escaping markup and script. The syntax for CDATA sections in XML is <![CDATA[ ... ]]>.

character data

XML structural construct. The text content of an element or attribute. XML differentiates this plain text from markup.

character set

A mapping of a set of characters to their numeric values. For example, Unicode is a 16- bit character set capable of encoding all known characters; it is used as a worldwide character-encoding standard.

component

An object that encapsulates both data and code, and provides a well-specified set of publicly available services.

data type

The type of content that an element contains: a number, a date, and so on. In XML, an author can specify an element's data type, for example, with a tokenized attribute type. Microsoft is working with the W3C to define a set of standard types that anyone can freely use.

document element

The top-level element of an XML document; only one top-level element is allowed. The document element is a child of the document root.

Document Object Model (DOM)

The standard maintained by the W3C that specifies how the content, structure, and appearance of Web documents can be updated programmatically with scripts or other programs. The proposed object model for XML matches the Document Object Model for HTML so that script writers can easily learn XML programming. The XML DOM will provide a simple means of reading and writing data to and from an XML tree structure.

document root

The top-level node of an XML document; its descendants branch out from it to form the XML tree for that document. The document root contains the document element and can also contain a set of processing instructions and comments.

document type declaration

XML structural construct. A production within an XML document that contains or points to markup declarations that provide a grammar for a class of documents. This grammar is known as a Document Type Definition. The document type declaration can point to an external subset (a special kind of external entity) containing markup declarations, or can contain the markup declarations directly in an internal subset, or both. The DTD for a document consists of both subsets taken together. The syntax of the document type declaration is <!DOCTYPE content >.

Document Type Definition (DTD)

The markup declarations that describe a grammar for a class of documents. The DTD is declared within the document type declaration production of the XML file. The markup declarations can be in an external subset (a special kind of external entity), in an internal subset directly within the XML file, or both. The DTD for a document consists of both subsets taken together.

Electronic Data Interchange (EDI)

An existing format used to exchange data and support transactions. EDI transactions can be conducted only between sites that have been specifically set up with compatible systems.

element

XML structural construct. An XML element consists of a start tag, and end tag, and the information between the tags, which is often referred to as the contents. Elements used in an XML file are described by a DTD or schema, either of which can provide a description of the structure of the data.

entity

XML structural construct. A character sequence or well-formed XML hierarchy associated with a name. The entity can be referred to by an entity reference to insert the entity's contents into the tree at that point. The function of an XML entity is similar to that of a macro definition. Entity declarations occur in the DTD.

entity reference

XML structural construct. Refers to the content of a named entity. The name is delimited by the ampersand and semicolon characters; for example, &bookname; and <. It is used in much the same way as a macro.

Extensible Linking Language (XLL)

An XML vocabulary that provides links in XML similar to those in HTML but with more functionality. Linking could be multidirectional, and links could exist at the object level rather than just at a page level.

Extensible Markup Language (XML)

A subset of SGML that provides a uniform method for describing and exchanging structured data in an open, text-based format, and delivers this data by use of the standard HTTP protocol. At the time of this writing, XML 1.0 is a World Wide Web Consortium Recommendation, which means that it is in the final stage of the approval process.

Extensible Stylesheet Language (XSL)

A language used to transform XML-based data into HTML or other presentation formats, for display in a Web browser. Differs from cascading style sheets in that it can present information in an order different from that in which it was received. XSL will also be able to generate CSS along with HTML. XSL consists of two parts, a vocabulary for transformation and the XSL Formatting Objects.

ID

A special attribute type within the XML language. The ID attribute on the XML element provides a unique name, enabling links to that element using the IDREF attribute type. The value associated with the ID attribute must be unique within that XML document. IDs are currently declared with a DTD or schema.

markup

XML structural construct. Text in an XML document that does not represent character data: start tags, end tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, DTDs, and processing instructions.

mixed content

XML structural construct. An element type has mixed content when elements of that type can contain character data, optionally interspersed with child elements. In this case, the types of the child elements can be constrained, but not their order or their number of occurrences.

namespace

A mechanism to resolve naming conflicts between elements in an XML document when each comes from a different vocabulary; it allows the commingling of like tag names from different namespaces. A namespace identifies an XML vocabulary defined within a URN. An attribute on an element, attribute, or entity reference associates a short name with the URN that defines the namespace; that short name is then used as a prefix to the element, attribute, or entity reference name to uniquely identify the namespace. Namespace references have scope. All child nodes beneath the node that specifies the namespace inherit that namespace. This allows nonqualified names to use the default namespace.

NDATA

The literal string "NDATA" is used as part of a notation declaration. See also notation.

notation

Usually refers to a data format, such as BMP. A notation identifies by name the format of unparsed entities, the format of elements that bear a notation attribute, or the application to which a processing instruction is addressed.

notation declaration

A notation declaration provides a name and an external identifier for a notation. The name is used in entity and attribute-list declarations and in attribute specifications. The external identifier is used for the notation, which can allow an XML processor or its client application to locate a helper application capable of processing data in the given notation.

processing instruction (PI)

XML structural construct. Instructions that are passed through to the application. The target is specified as part of the PI. The syntax for a PI is <?pi-name content?>.

Resource Definition Framework (RDF)

An object model similar in function to an application programming interface (API), RDF can be used by developers to access the logical meaning of designated content in XML documents.

root element

Sometimes this term is used to refer to the document element but this is misleading, since the top-level element and the document root are not the same. Because of this ambiguity, use of the term "root element" is discouraged.

schema

A formal specification of element names that indicates which elements are allowed in an XML document, and in which combinations. A schema is functionally equivalent to a DTD, but is written in XML; a schema also provides for extended functionality such as data typing, inheritance, and presentation rules.

Standard Generalized Markup Language (SGML)

The international standard for defining descriptions of structure and content of electronic documents. XML is a subset of SGML designed to deliver SGML-type information over the Web.

target

The application to which a processing instruction is directed. The target names beginning with "XML" and "xml" are reserved. The target appears as the first token in the PI. For example, in the XML declaration <?xml version="1.0"?>, the target is "xml".

text markup

Inserting tags into the middle of an element's text flow, to mark certain parts of the element with additional meta-information.

tokenized attribute type

Each attribute has an attribute type. Seven attribute types are characterized as tokenized: ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, and NMTOKENS.

Uniform Resource Identifier (URI)

The generic set of all names and addresses that refer to resources, including URLs and URNs. Defined in Berners-Lee, T., R. Fielding, and L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax and Semantics. 1997. See updates to the W3C document RFC1738. The Layman-Bray proposal for namespaces makes every element name subordinate to a URI, which would ensure that element names are always unambiguous.

Uniform Resource Locator (URL)

The set of URI schemes that have explicit instructions on how to access the resource on the Internet.

Uniform Resource Name (URN)

A Uniform Resource Name identifies a persistent Internet resource.

valid XML

XML that conforms to the vocabulary specified in a DTD or schema.

W3C

World Wide Web Consortium

well-formed XML

XML that meets the requirements listed in the W3C Recommendation for XML 1.0: It contains one or more elements; it has a single document element, with any other elements properly nested under it; each of the parsed entities referenced directly or indirectly within the document is well-formed. A well-formed XML document does not necessarily include a DTD.

World Wide Web Consortium (W3C)

The international consortium founded in 1994 to develop standards for the Web. See

XLL

Extensible Linking Language

XML

Extensible Markup Language

XML declaration

The first line of an XML file can optionally contain the "xml" processing instruction, which is known as the XML declaration. The XML declaration can contain pseudoattributes to indicate the XML language version, the character set, and whether the document can be used as a standalone entity.

XML document

A data object that is well-formed, according to the XML recommendation, and that might (or might not) be valid. The XML document has a logical structure (composed of declarations, elements, comments, character references, and processing instructions) and a physical structure (composed of entities, starting with the root, or document entity).

XML parser

A generalized XML parser reads XML files and generates a hierarchically structured tree, then hands off data to viewers and other applications for processing. A validating XML parser also checks the XML syntax and reports errors.

Next Section: Multimedia