Upon completion of this chapter, you will
- understand the potential of XML based web delivery of audio text files
- understand the use of SSML (Synthetic Speech Markup Language), the XML subset of tags
How can text files of any content a user chooses be transferred to the user in such a form that the end user can listen to them on a piece of hardware that has mobile internet capability? An example would be someone listening to a news report of any topic or from any region just by converting text files which already exist on the internet to a sythesized voice for playing on a mobile device in an automobile.
Existing Technology: There are several technologies available that work with voice data. Voice XML provides a framework for transferring voice data between entities. It is used for interactive voice triggered tools. This technology is used extensively for phone menus and automated help by companies with customer service or other areas of high call volume needs.
Internet radio exists and provides a user with music or other programming that is broadcast throughout the internet. This programming is not up to the choice of the end user other than to select the internet station to listen to.
Software exists that can convert any text file into an audio file. Text files can be converted to audio files using software provided in the Windows or Mac operating systems or very inexpensive stand-alone software (an example is TextAloud). TextAloud allows a user to modify the voice, the pace of reading and other features. Free versions of it are available online. These systems can modify the voice in multiple ways to the personal taste of a user. These systems do not make the files available over the internet for users to search and listen to.
With the right combination of XML technology, mobile communication services and software/hardware that already exists the idea of internet radio could be opened to a much larger volume of content than currently exists. Most internet radio is in the form of music files and programed radio content. The choices of internet radio could be extended to include any existing text file which would include news reports, government documents, educational materials and many forms of official records. A business example would be a travelling salesman briefing himself on a client’s buying history by listening to a file in his car on the way to a sales call with the customer. Another example includes language conversion software that already exists that could enable a person in a distant county to listen and learn about technology that is being developed somewhere else.
The technology would require three areas to come together to make the process work. 1. The XML technology must include a set of agreed upon XML tags for transferring files between content generators/distrubutors and users. 2. The mobile communications services must be able to deliver the data in a usuable format to an end user system. 3. Hardware and software must be able to make use of the documents sent and play them for a user. Included in this is the further development of a voice processing browser.
The second and third requirements are outside the scope of this chapter on XML. However, work is being done on them. The W3C (World Wide Web Consortium) is currently working on the Mobile Web Initiative which would set some standards for software vendors, content providers, hardware (handset) manufacturers, browser developers and mobile service operators. One suggestion being considered is a maximum page weight of 10K (a typical magazine article fits within that range). The availability of advertising being embedded and what form it would take is under debate. Delivery protocol is expected to be http. The connection for a mobile device can be slow but the audio files do not have to stream. Current vendors involved include Nokia, Ericsson, HP, France Telecom and Opera.
The first requirement would include a set of XML tags that all text file content generators (such as news services, governments, educational institutions and official records generators) could use to generate files of their content. Thus their content could be accessed and stored in a searchable database and requested for downloading and playback at any time from anywhere that supports a mobile browser device.
The Existing Tag Set:Edit
There is an existing set of XML tags called SSML (Synthesized Speech Markup Language). This set enables control of enough aspects of speech generation that a personable voice can be generated and manipulated by a user. A Text-to-Speech system uses the tags to take a text file and generate audible text in a voice.
Document Structure, Text Processing and Pronunciation Elements and Attributes:
speak - Root Element xml:lang - Attribute
Language (indicates the natural language of the file, such as “en-US”); this is preferred to be indicated only on the voice element so as to eliminate changes in a voice in the midst of a voice file.
xml:base - Attribute
base URI Attribute (optional)
xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> ... the body ...
lexicon - Element
for pronunciation, (an empty element)
meta - Element
(an empty element); includes a string that contains some information about the ensuing data; it can declare a content type of “http” in the case of a file that doesn’t have generated header fields from the originating server.
metadata - Element
can provide broader information about data as it accesses a metadata schema.
p - Element
text structure, represents a paragraph. It can only contain the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, voice.
s - Element
text structure, Element; represents a sentence. It can only contain the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.
say-as - Element
available attributes: interpret-as, format, and detail phoneme with interpret-as being the only required one. The tag set may only contain text to be rendered by a voice synthesizer. This tag helps a browser to know more about the manner in which the enclosed text is to be voiced.
format - Attribute
this attribute gives additional hints as to the rendering of voiced text. detail - Attribute this attribute is for indicating the level of detail to be applied to voiced text. An example would be a special form of emphasis such as the reading of computer code in a block of text.
Phoneme - Element
a pronunciation indicator for the text to speech engine. The engine does not render the contents of the tag, thus the tag can be empty. The attributes for the tag provide what the engine will use to help with language specific pronunciation factors. However, any text between the tag set will be rendered on screen in a visual browser for hearing impaired users. This tag can only contain text, no elements. alphabet - attribute for Phoneme, used to specify a particular version of an alphabet, optional ph - Attribute a required attribute for phoneme, used to specify the string to be pronounced.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> pomegranate </phoneme>
sub - Element
an element used to specify within its “alias” attribute the pronounced version of some written text that is between the tag set. Example:
Prosody and Style - prosody covers such things as tone, intonation, conversational pacing, pitch of voice, loudness, duration of sound, chunking (units of words, not necessarily sentences).
voice - Element
indicates the type of voice to use, all the attributes are optional, however, not indicating any attributes at all is considered an error. The “lang” attribute takes precedence; all other attributes are equal.
lang - attribute
for voice element, indicates the language for the voice.
gender - Attribute age - Attributte'' variant - Attributte name - Attribute
<voice gender="male">Show me a person without a goal</voice>
<voice gender="male" variant="2"> and I'll show you a stock clerk. </voice> <voice name="James">Show me a stock clerk with a goal and I'll show you someone who will change the world.</voice>
emphasis - Element
contains text to be emphasized by the speech processor (with stress or intensity). It has one attribute:
level - Attribute
indicating the degree of emphasis.
Geniuses themselves don't talk about the gift of genius, they just talk about
<emphasis level="strong"> hard work and long hours. </emphasis>
The "emphasis" element can contain text and the following elements: audio - Element desc - Element
if the content is not speech then the “desc” tag should be used to describe the content. This description can be used in a text output for the hearing impaired.
break - Element emphasis - Element mark - Element phoneme - Element prosody - Element say-as - Element sub - Element voice - Element
break - Element
wherever the element is used between words it indicates a pause in the reading of the text; attributes are: “strength” with values of: none (meaning no pause even if the system would normally put one there), x-weak, weak, medium, strong, x-strong; “time” with values of either milliseconds: 250ms or seconds: 2s.
prosody - Element
controls the pitch, speaking rate and volume of a generated voice. Attributes are optional but it is considered an error if no attributes are set. pitch - Attribute contour - Attribute range - Attribute rate - Attribute duration - Attribute volume - Attribute
Other elements that allow the insertion of audio files in addition to generated voice content.
audio - Element
may be empty but if it contains anything it should be the text that the speech generator could convert to a voice in place of the audio file.
<audio src="JCPennyQuote.au">Every business is built on friendship.</audio>
mark - Element
an empty tag that places a named marker into the content. When the processor reaches a “mark” element one of two things happens. One, the processor is provided with the info to retrieve the desired position in the content, two, an event is issued that includes the content at the desired position. It has one attribute which is: name - Attribute
desc - Element
Potential Future of XML Web Audio:Edit
Additional tags could be introduced to contain dates, titles of files, authors, originating language and other metadata about the files. Expanding the set of existing tags would enable the files to be stored and searched in databases using multiple methods. They would enable storing of data related to the actual text/audio files that would be valuable to potential users. A user could search based on originating date of the file, the originating country of the file and subject or title of files.
Using SSML, a subset of XML, audio files can be generated from any text file such as news reports, government documents, educational materials or official records. This content could be delivered via mobile communication services and over the web. The files could be played on mobile browser devices. This could constitute a much larger market for internet radio than the strictly music or programmed content form it exists in today. This could generate many uses for on-demand access to many sources of information for travelling users.