XML - Managing Data Exchange/XMLWebAudio

Previous Chapter Next Chapter
XForms OpenOffice.org & OpenDocument Format

XML Web VoiceEdit

Learning ObjectivesEdit

Upon completion of this chapter, you will

  • understand the potential of XML based web delivery of audio text files
  • understand the use of SSML (Synthetic Speech Markup Language), the XML subset of tags

The Problem:Edit

How can text files of any content a user chooses be transferred to the user in such a form that the end user can listen to them on a piece of hardware that has mobile internet capability? An example would be someone listening to a news report of any topic or from any region just by converting text files which already exist on the internet to a sythesized voice for playing on a mobile device in an automobile.

Existing Technology: There are several technologies available that work with voice data. Voice XML provides a framework for transferring voice data between entities. It is used for interactive voice triggered tools. This technology is used extensively for phone menus and automated help by companies with customer service or other areas of high call volume needs.

Internet radio exists and provides a user with music or other programming that is broadcast throughout the internet. This programming is not up to the choice of the end user other than to select the internet station to listen to.

Software exists that can convert any text file into an audio file. Text files can be converted to audio files using software provided in the Windows or Mac operating systems or very inexpensive stand-alone software (an example is TextAloud). TextAloud allows a user to modify the voice, the pace of reading and other features. Free versions of it are available online. These systems can modify the voice in multiple ways to the personal taste of a user. These systems do not make the files available over the internet for users to search and listen to.

The Potential:Edit

With the right combination of XML technology, mobile communication services and software/hardware that already exists the idea of internet radio could be opened to a much larger volume of content than currently exists. Most internet radio is in the form of music files and programed radio content. The choices of internet radio could be extended to include any existing text file which would include news reports, government documents, educational materials and many forms of official records. A business example would be a travelling salesman briefing himself on a client’s buying history by listening to a file in his car on the way to a sales call with the customer. Another example includes language conversion software that already exists that could enable a person in a distant county to listen and learn about technology that is being developed somewhere else.


The technology would require three areas to come together to make the process work. 1. The XML technology must include a set of agreed upon XML tags for transferring files between content generators/distrubutors and users. 2. The mobile communications services must be able to deliver the data in a usuable format to an end user system. 3. Hardware and software must be able to make use of the documents sent and play them for a user. Included in this is the further development of a voice processing browser.

The second and third requirements are outside the scope of this chapter on XML. However, work is being done on them. The W3C (World Wide Web Consortium) is currently working on the Mobile Web Initiative which would set some standards for software vendors, content providers, hardware (handset) manufacturers, browser developers and mobile service operators. One suggestion being considered is a maximum page weight of 10K (a typical magazine article fits within that range). The availability of advertising being embedded and what form it would take is under debate. Delivery protocol is expected to be http. The connection for a mobile device can be slow but the audio files do not have to stream. Current vendors involved include Nokia, Ericsson, HP, France Telecom and Opera.

The first requirement would include a set of XML tags that all text file content generators (such as news services, governments, educational institutions and official records generators) could use to generate files of their content. Thus their content could be accessed and stored in a searchable database and requested for downloading and playback at any time from anywhere that supports a mobile browser device.


The Existing Tag Set:Edit

There is an existing set of XML tags called SSML (Synthesized Speech Markup Language). This set enables control of enough aspects of speech generation that a personable voice can be generated and manipulated by a user. A Text-to-Speech system uses the tags to take a text file and generate audible text in a voice.

Document Structure, Text Processing and Pronunciation Elements and Attributes:

speak - Root Element xml:lang - Attribute

                Language (indicates the natural language of the file, such as “en-US”); this 
                is preferred to be indicated only on the voice element so as to eliminate 
                changes in a voice in the midst of a voice file.

xml:base - Attribute

                base URI Attribute (optional)


<speak version="1.0"

 ... the body ...


lexicon - Element

       for pronunciation, (an empty element)

meta - Element

       (an empty element); includes a string that contains some information about the  
       ensuing data; it can declare a content type of “http” in the case of a file that 
       doesn’t have generated header fields from the originating server.

metadata - Element

       can provide broader information about data as it accesses a metadata schema.

p - Element

       text structure, represents a paragraph. It can only contain the following elements:   
       audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, voice.

s - Element

       text structure, Element; represents a sentence. It can only contain the following   
       elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

say-as - Element

       available attributes: interpret-as, format, and detail phoneme with interpret-as being 
       the only required one. The tag set may only contain text to be rendered by a voice    
       synthesizer. This tag helps a browser to know more about the manner in which the  
       enclosed text is to be voiced.

format - Attribute

               this attribute gives additional hints as to the rendering of voiced text.   detail - Attribute
               this attribute is for indicating the level of detail to be applied to voiced  
               text. An example would be a special form of emphasis such as the reading of 
               computer code in a block of text.

Phoneme - Element

       a pronunciation indicator for the text to speech engine. The engine does not render 
       the contents of the tag, thus the tag can be empty. The attributes for the tag provide 
       what the engine will use to help with language specific pronunciation factors.     
       However, any text between the tag set will be rendered on screen in  a visual browser  
       for hearing impaired users. This tag can only contain text, no elements.
       alphabet - attribute 
               for Phoneme, used to specify a particular version of an alphabet, optional
       ph - Attribute
               a required attribute for phoneme, used to specify the string to be pronounced.


<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

 <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> pomegranate </phoneme>


sub - Element

       an element used to specify within its “alias” attribute the pronounced version of some 
       written text that is between the tag set. Example:


Prosody and Style - prosody covers such things as tone, intonation, conversational pacing, pitch of voice, loudness, duration of sound, chunking (units of words, not necessarily sentences).

voice - Element

       indicates the type of voice to use, all the attributes are optional, however, not 
       indicating any attributes at all is considered an error. The “lang” attribute takes  
       precedence; all other attributes are equal.

lang - attribute

                for voice element, indicates the language for the voice.

gender - Attribute age - Attributte'' variant - Attributte name - Attribute


<voice gender="male">Show me a person without a goal</voice>

 <voice gender="male" variant="2">
 and I'll show you a stock clerk.
 <voice name="James">Show me a stock clerk with a goal and I'll show you someone who will change the world.</voice>

emphasis - Element

       contains text to be emphasized by the speech processor (with stress or intensity). It  
       has one attribute:

level - Attribute

                indicating the degree of emphasis.


Geniuses themselves don't talk about the gift of genius, they just talk about

<emphasis level="strong"> hard work and long hours. </emphasis>

The "emphasis" element can contain text and the following elements: audio - Element desc - Element

                if the content is not speech then the “desc” tag should be used to describe   
                the content. This description can be used in a text output for the hearing   

break - Element emphasis - Element mark - Element phoneme - Element prosody - Element say-as - Element sub - Element voice - Element

break - Element

       wherever the element is used between words it indicates a pause in the reading of the  
       text; attributes are: “strength” with values of: none (meaning no pause even if the 
       system would normally put one there), x-weak, weak, medium, strong, x-strong; “time” 
       with values of either milliseconds: 250ms or seconds: 2s. 

prosody - Element

       controls the pitch, speaking rate and volume of a generated voice. Attributes   
       are optional but it is considered an error if no attributes are set. 
       pitch - Attribute
       contour - Attribute
       range - Attribute
       rate - Attribute
       duration - Attribute
       volume - Attribute

Other elements that allow the insertion of audio files in addition to generated voice content.

audio - Element

       may be empty but if it contains anything it should be the text that the speech 
       generator could convert to a voice in place of the audio file. 


 <audio src="JCPennyQuote.au">Every business is built on friendship.</audio>

mark - Element

       an empty tag that places a named marker into the content. When the processor  
       reaches a “mark” element one of two things happens. One, the processor is provided 
       with the info to retrieve the desired position in the content, two, an event is issued 
       that includes the content at the desired position. It has one attribute which is:
       name - Attribute

desc - Element

Potential Future of XML Web Audio:Edit

Additional tags could be introduced to contain dates, titles of files, authors, originating language and other metadata about the files. Expanding the set of existing tags would enable the files to be stored and searched in databases using multiple methods. They would enable storing of data related to the actual text/audio files that would be valuable to potential users. A user could search based on originating date of the file, the originating country of the file and subject or title of files.


Using SSML, a subset of XML, audio files can be generated from any text file such as news reports, government documents, educational materials or official records. This content could be delivered via mobile communication services and over the web. The files could be played on mobile browser devices. This could constitute a much larger market for internet radio than the strictly music or programmed content form it exists in today. This could generate many uses for on-demand access to many sources of information for travelling users.

Last modified on 2 September 2012, at 20:45