Open main menu

Voicexml examplesEdit

According to the W3C, "VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications."

Here are two short examples of VoiceXML. The first is the always fun example, "Hello World":

Hello world

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="" xmlns:xsi="" xsi:schemaLocation="" version="2.0">
<block>Hello World!</block>

The top-level element is <vxml>, which is mainly a container for dialogs. The two main types of dialogs are forms and menus. Forms present information and gather input. Menus offer choices of what to do next. This example has a single form, which contains a block that synthesizes and presents "Hello World!" to the user. Since the form does not specify a dialog after "Hello World", the conversation ends.

Our second example asks the user for a choice of drink and then submits it to a server script:

Form example:

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="" xmlns:xsi=""
<field name="drink">
<prompt>Would you like coffee, tea, milk, or nothing?</prompt>
<grammar type="application/x-gsl" mode="voice">
[coffee] {<drink "Coffee">}
[tea] {<drink "Tea">}
[milk] {<drink "Milk">}
[nothing] {<drink "Nothing">}
<submit next=""/>

A field is an input field. The user must provide a value for the field before the next element in the form is referenced or executed. Here is an example of a simple interaction:

  • C (computer): Would you like coffee, tea, milk, or nothing?
  • H (human): Orange juice.
  • C: I did not understand what you said. (a platform-specific default message.)
  • C: Would you like coffee, tea, milk, or nothing?
  • H: Tea
  • C: (continues in document drink2.asp)

Menu example:

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns=""
<property name="inputmodes" value="dtmf"/>
For sports press 1, For weather press 2, For Stargazer astrophysics press 3.
<choice dtmf="1" next=""/>
<choice dtmf="2" next=""/>
<choice dtmf="3" next=""/>

The computer, or receiver, recognizes the number and sends a message to trigger the next dialog, according to which number was chosen. Here is what a typical conversation would look like:

  • C: For Sports press 1, For weather press 2, For Stargazer astrophysics press 3.
  • H: 4
  • C: I did not understand what value you typed. (a platform-specific default message.)
  • C: For Sports press 1, For weather press 2, For Stargazer astrophysics press 3.
  • H: 1 “sports”.
  • C: (proceeds to

The beginning of VoiceXMLEdit

VoiceXML began in 1995 as an XML-based dialog design language. It was mainly used to simplify the speech recognition applications in an AT&T project called Phone Markup Language (PML). After the creation of this language, some other companies worked on their own PML-like languages such as Lucent, Motorola (VoxML), IBM (SpeechML), HP (TalkML) and PipeBeach (VoiceHTML). Since 1998, The VoiceXML Forum has been developed by AT&T, IBM, Lucent, and Motorola to define a standard dialog design language that developers could use to build conversational applications. They chose XML as the basis for this effort because it was clear to them that this was the direction technology was going. By 2000, the VoiceXML Forum released VoiceXML 1.0 to the public and submitted it to the W3C to set the language as an international standard. This implementation allowed the release of VoiceXML 2.0, based on input from W3C member companies, W3C working groups, and all kinds of developers.


VoiceXML is created to generate audio dialogs that allows the use of synthesized speech, digitized audio, recognition of spoken and DTMF(Dual Tone Multi-Frequency Touch-tone or push-button dialing.) In Layman's Terms, VoiceXML allows the use of computer speech, recorded audio, human speech, and telephones as input and output devices. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency) key input, recording of spoken input, telephony, and mixed initiative conversations.

VoiceXML architectural modelEdit

The architectural model assumed by this document has the following components:


A document server (e.g. a Web server) processes requests from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.

The Goals of VoiceXMLEdit

VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. VoiceXML sets an integration environment between voice services and data services taking advantage of the client-server paradigm. A voice service can be defined as a sequence of interactive dialogs between a user and an implementation platform. The dialogs are stored in document servers, allowing an independent structure from the implementation platform. These servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document interacts with the dialogs from the server using a VoiceXML interpreter. The inputs from the user generates requests to the document server, and finally, the document server replies with another VoiceXML document to continue the user’s session with other dialogs.

VoiceXML is a markup language that:

  • Minimizes client/server interactions generating all kinds of interactions per document.
  • Shields application authors from low-level, and platform-specific details.
  • Separates user interaction code (in VoiceXML) from service logic (e.g. CGI scripts).
  • Allows multiplatform development, becoming a common language for content providers, tool providers, and platform providers.
  • Offers ease of use for simple interactions, and yet provides language features to support complex dialogs.

While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.

Principles of DesignEdit

VoiceXML is an XML application [XML]. These are some of the capabilities, or abilities VoiceXML carries:

  • The language promotes portability of services through abstraction of platform resources.
  • The language accommodates platform diversity in supported audio file formats, speech grammar formats, and URI schemes.
  • The language makes it easy to create common types of interactions.
  • The language has well-defined standards of wording and syntax that allows for the author's intent regarding the behavior of interactions with the user much easier.
  • The language recognizes semantic interpretations from all types of grammars and makes this information available to the application.
  • The language has a control flow mechanism.
  • The language enables a separation of service logic from interaction behavior.
  • It is not intended for intensive computation, database operations, or legacy system operations. These are assumed to be handled by resources outside the document interpreter, e.g. a document server.
  • General service logic, state management, dialog generation, and dialog sequencing are assumed to reside outside the document interpreter.
  • The language provides ways to link documents and submit data to server scripts using URIs.
  • VoiceXML provides ways to identify exactly which data to submit to the server, and which HTTP method (GET or POST) to use in the submittal.
  • The language does not require document authors to explicitly allocate and deallocate dialog resources.

Implementation Platform RequirementsEdit

This section outlines the hardware/software requirements to support a VoiceXML interpreter:

Document acquisition: The interpreter context is expected to acquire documents from the VoiceXML interpreter, requiring the support of the "http" URI protocol. There will be some cases in which the document request is generated by the interpretation of a VoiceXML document, but it can also be generated in response to events outside the scope of the language, like an incoming phone call. When issuing document requests via http, the interpreter context identifies itself using the "User-Agent" header variable with the value "<name>/<version>", for example, "acme-browser/1.2"

Audio output: An implementation platform must support audio output using audio files and text-to-speech (TTS). The platform must be able to freely sequence TTS and audio output. If an audio output resource is not available, an error.noresource event must be thrown. These files are referenced by a particular URI.

Audio input: An implementation platform needs to find the way to detect and report character and/or spoken input simultaneously. It also needs to control input detection interval duration with a timer whose length is specified by a VoiceXML document.

  • Platforms must support the XML form of DTMF grammars described in the W3C Speech Recognition Grammar Specification SRGS.
  • It must be able to receive speech recognition grammar data dynamically.
  • It can support other formats such as the JSpeech Grammar Format or proprietary formats.
  • It must be able to record audio received from the user.
  • The platform should be able to support making a third party connection through a communications network, such as the telephone.

Transfer: The platform should be able to support making a third party connection through a communications network, such as the telephone.


A VoiceXML document is a conversational finite state machine, in which the user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions can be defined using URIs, which define the next document and dialog to use. When there are no more dialogs, or there is an element that explicitly exits the conversation, the execution is terminated. A VoiceXML document is primarily composed of top-level elements called dialogs.

There are two types of dialogs: forms and menus. A document may also have:

  • <meta> elements.
  • <metadata> elements.
  • variable elements.
  • <script> elements.
  • <property> elements.
  • <catch> elements.
  • <link> elements.

Forms define an interaction that collects values from a set of field item variables. Each field may specify a grammar that defines the allowable inputs for that field.

Menus display the information to the user with a choice of options and then transitions to another dialog based on the selected choice. Each dialog has involved a series of speech and/or DTMF grammars, which are active only when the user is in that dialog.

A subdialog is like a function call because it provides a way to creating and invoking a new interaction, and returning to the original dialog. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used to create a confirmation sequence that may require a database query, create a set of components that may be shared among documents in a single application, or possibly to create a reusable library of dialogs shared among many applications.

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.

An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application.

Grammars: Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

Events: VoiceXML allows the user to fill forms in the traditional way of user input and defines mechanisms for handling events not covered by the form mechanism. Events can be thrown when the user does not respond, does not respond correctly, or requests assistance. Similarly, the VoiceXML interpreter also can throw events if it finds a semantic error in a VoiceXML document using catch elements that allow the interpreter to trigger such events.

A link specifies a grammar that is active whenever the user interacts with it. If user input matches the link’s grammar, control transfers to the link’s destination URI. A link can be used to throw an event or go to a destination URI.

VoiceXML elementsEdit

For more information about the elements go to W3C page.

Element Purpose
<assign> Assign a variable a value
<audio> Play an audio clip within a prompt
<block> A container of (non-interactive) executable code
<catch> Catch an event
<choice> Define a menu item
<clear> Clear one or more form item variables
<disconnect> Disconnect a session
<else> Used in <if> elements
<elseif> Used in <if> elements
<enumerate> Shorthand for enumerating the choices in a menu
<error> Catch an error event
<exit> Exit a session
<field> Declares an input field in a form
<filled> An action executed when fields are filled
<form> A dialog for presenting information and collecting data
<goto> Go to another dialog in the same or different document
<grammar> Specify a speech recognition or DTMF grammar
<help> Catch a help event
<if> Simple conditional logic
<initial> Declares initial logic upon entry into a (mixed initiative) form
<link> Specify a transition common to all dialogs in the link’s scope
<log> Generate a debug message
<menu> A dialog for choosing amongst alternative destinations
<meta> Define a metadata item as a name/value pair
<metadata> Define metadata information using a metadata schema
<noinput> Catch a noinput event
<nomatch> Catch a nomatch event
<object> Interact with a custom extension
<option> Specify an option in a <field>
<param> Parameter in <object> or <subdialog>
<prompt> Queue speech synthesis and audio output to the user
<property> Control implementation platform settings.
<record> Record an audio sample
<reprompt> Play a field prompt when a field is re-visited after an event
<return> Return from a subdialog.
<script> Specify a block of ECMAScript client-side scripting logic
<subdialog> Invoke another dialog as a subdialog of the current one
<submit> Submit values to a document server


Throw an event.


Transfer the caller to another destination
<value> Insert the value of an expression in a prompt
<variable> Declare a variable
<vxml> Top-level element in each VoiceXML document

One Document ExecutionEdit

Document execution starts with the first dialog by default. As each dialog executes, the next dialog is determined. When a dialog doesn't reference another dialog, document execution stops.

Here is the "Hello World!" example expanded to illustrate VoiceXML execution. It now has a document level variable called "hi" which holds the greeting. Its value is used as the prompt in the first form. Once the first form plays the greeting, it goes to the form named "say_goodbye", which prompts the user with "Goodbye!" Because the second form does not have a transition to another dialog, the document execution ceases.

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="" 
 <meta name="author" content="John Doe"/>
 <meta name="maintainer" content=""/>
 <var name="hi" expr="'Hello World!'"/>
     <value expr="hi"/>
     <goto next="#say_goodbye"/>
 <form id="say_goodbye">

Variables and ExpressionsEdit

VoiceXML variables are in all respects equivalent to ECMAScript variables: they are part of the same variable space. VoiceXML variables can be used in a <script> just as variables defined in a <script> can be used in VoiceXML. Declaring a variable using var is equivalent to using a var statement in a <script> element. <script> can also appear everywhere that var can appear. VoiceXML variables are also declared by form items.

The variable naming convention is as in ECMAScript, but names beginning with the underscore character ("_") and names ending with a dollar sign ("$") are reserved for internal use. VoiceXML variables, including form item variables, must not contain ECMAScript reserved words. They must also follow ECMAScript rules for referential correctness. For example, variable names must be unique and their declaration must not include a dot - "var x.y" is an illegal declaration in ECMAScript. Variable names which violate naming conventions or ECMAScript rules cause an 'error.semantic' event to be thrown.

Variables are expressed using the var element:

<var name="room_number"/> 
<var name="avg_mult" expr="2.2"/> 
<var name="state" expr="'Georgia'"/>

<vxml> ElementEdit

<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns=""


Attributes of <vxml> include:

version The version of VoiceXML of this document (required). The current version number is 2.0.
base Defines a base URI, to be used when resolving relative URIs in the document.
xmlns The designated namespace for VoiceXML (required). The namespace for VoiceXML is defined to be
xml:base The base URI for this document as defined in the xml-base. It is a URI which all relative references within the document take as their base.
xml:lang The language identifier for this document . If omitted, the value is a platform-specific default.
Xmlns:xsi Used along with xsi:schemalocation to indicate the location of the schema for the VoiceXML namespace.
xsi:schemalocation Used along with xmlns:xsi to indicate the location of the schema for the VoiceXML namespace.
application The URI of this document’s application root document, if any.

<field> ElementEdit

A field specifies an input item to be gathered from the user. Some attributes of this element are:

name The form item variable in the dialog scope that will hold the result. The name must be unique among form items in the form.
expr The initial value of the form item variable; default is ECMAScript undefined.
cond An expression that must evaluate to true after conversion to boolean in order for the form item to be visited. The form item can also be visited if the attribute is not specified.
Type The type of field, i.e., the name of a built in grammar type

<grammar> ElementEdit

The <grammar> element is used to provide a speech grammar that

  • specifies the expressions that a user may use to perform an action or supply information
  • returns a corresponding semantic interpretation such as simple values (strings), attribute-value pairs (day, month, and year), or nested objects.

Some attributes of the <grammar> element are:

version Defines the version of the grammar.
xml:lang The identifier for the language of that specific grammar ("fr-CA" for Canadian French.)
mode Defines the mode of the grammar following the modes of the W3C Speech Recognition Grammar Specification SRGS.
root Defines the root rule of the grammar.
tag-format Defines the tag content format for all tags within the grammar.
xml:base Declares the base URI from which relative URIs in the grammar are resolved. This base declaration has precedence over the <vxml> base URI declaration.

<block> ElementEdit

This element is a form item. It contains executable content that is executed if the block’s form item variable is undefined and the block's cond attribute, if any, evaluates to true.

    Welcome to Flamingo, your source for lawn ornaments.

The form item variable is automatically set to true just before the block is entered. Therefore, blocks are typically executed when the form is called. Sometimes you may need more control over blocks. To do this, you can name the form item variable, and set or clear it to control execution of the <block>. This variable is declared in the dialog scope of the form. Attributes of <block> include:

name The name of the form item variable used to track whether this block is eligible to be executed; defaults to an inaccessible internal variable.
expr The initial value of the form item variable; default is ECMAScript undefined.
cond An expression that must evaluate to true after conversion to boolean in order for the form item to be visited.

<prompt> ElementEdit

This element controls the output of synthesized speech and prerecorded audio. Prompts are queued for play, and interpretation will start when the user provides an input. Here is an example of a prompt:

<prompt>Please say your name.</prompt>

You can leave out the <prompt> ... </prompt> if:

  • There is no need to specify a prompt attribute (like bargein), and
  • The prompt consists entirely of PCDATA (contains no speech markups) or consists of just an <audio> or <value> element.

For instance, these are also prompts:

Please say your name.

<audio src="sayname.wav"/>

But sometimes you have to use the <prompt> tags when adding embedded speech markups, such as:

<prompt>Please <emphasis>say</emphasis> your city.</prompt>

The <prompt> element has the following attributes:

Cond Expression that must evaluate to true after conversion to boolean in order for the prompt to be played. Default is true.
Count Number that allows you to emit different prompts if the user is doing something repeatedly. If omitted, it defaults to "1".

Timeout The timeout that will be used for the following user input. The default noinput timeout is platform specific.

xml:lang The language for the prompt identifier.
xml:base Declares the base URI from which relative URIs in the prompt are resolved.


1. Create a VoiceXML document in which you give the user three different options to choose from the keyboard. The user must choose one option between hotels, museums or restaurants. Use forms for this exercise. Hint: this exercise needs to use the option element tag Example: <option dtmf="1" value="varName"> Display name </option>

2. Create a VoiceXML document in which you give the user three different options to choose from the keyboard. The user must choose one option between hotels, museums or restaurants. Use menu dialogs for this exercise.