WO2003091827A2 - A system and method for creating voice applications - Google Patents

A system and method for creating voice applications Download PDF

Info

Publication number
WO2003091827A2
WO2003091827A2 PCT/GB2002/001929 GB0201929W WO03091827A2 WO 2003091827 A2 WO2003091827 A2 WO 2003091827A2 GB 0201929 W GB0201929 W GB 0201929W WO 03091827 A2 WO03091827 A2 WO 03091827A2
Authority
WO
WIPO (PCT)
Prior art keywords
client
server
format
voice
language
Prior art date
Application number
PCT/GB2002/001929
Other languages
French (fr)
Other versions
WO2003091827A3 (en
Inventor
Emmanuel Rayner
Original Assignee
Fluency Voice Technology Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fluency Voice Technology Limited filed Critical Fluency Voice Technology Limited
Priority to PCT/GB2002/001929 priority Critical patent/WO2003091827A2/en
Priority to AU2002253334A priority patent/AU2002253334A1/en
Publication of WO2003091827A2 publication Critical patent/WO2003091827A2/en
Publication of WO2003091827A3 publication Critical patent/WO2003091827A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design

Definitions

  • the present invention relates to the development and deployment of applications for access via a voice browser or similar in a client-server environment.
  • WWW Worldwide Web
  • the WWW has moved beyond the mere provision of data, and is also widely used for performing transactions, such as on-line shopping, making travel reservations, and so on.
  • the WWW is based on pages presented in Hypertext Markup Language (HTML), which are accessed from a server over the Internet by a client using the hypertext transport protocol (HTTP).
  • HTTP hypertext transport protocol
  • the client typically a conventional personal computer, normally runs browser software for this purpose, such as Microsoft Internet Explorer.
  • This type of client generally connects to the Internet though a modem/telephone link to an Internet Service Provider (ISP), or over a local area network (LAN) to an Internet gateway.
  • ISP Internet Service Provider
  • LAN local area network
  • Modern desktop computers generally support a wide range of multimedia capabilities, including graphics, sound, animation, and so on, which can be exploited by WWW content.
  • WML Wireless Markup Language
  • WAP Wireless Application Protocol
  • a WWW server 110 provides content that is accessible over the Internet 120 (or any other suitable form of data connection, such as intranet, LAN, etc). Also shown in Figure 1 is a client system 130, which interacts with the WWW server 110 over the Internet 120. Client system 130 is further connected to a conventional telephone 150 via the public switched telephone network (PSTN) 140 (although mobile/cellular telephone networks, etc could also be used).
  • PSTN public switched telephone network
  • the client system 130 acts as an intermediary in the overall architecture between the WWW server 110 and the user of telephone 150, and is sometimes referred to as a voice browser.
  • the user of telephone 150 typically dials the number corresponding to client system 130, which may be implemented in known fashion by an interactive voice response (IVR) system.
  • the client system 130 accesses WWW server 110 in order to retrieve information for handling the call, converts the information into audio using a text to speech (TTS) capability, and then transmits this audio over the telephone network to the user at telephone 150.
  • client system 130 can also receive audio input from the user of telephone 150 and convert this into a form suitable for transmission back to WWW server 110.
  • Such audio input is generally in the form of dual tone multiple frequency (DTMF) key presses and/or spoken input.
  • client system 30 includes a speech recognition (Reco) system.
  • a typical caller transaction using the system of Figure 1 is likely to involve an audio dialogue between the caller and the client system 130, and/or a HTTP based dialogue between the client system 130 and the WWW server 110.
  • the client system 130 may know itself to prompt the caller for this information, and then send a complete request to the WWW server 110.
  • the client system 130 may send an initial request without date/time information to the WWW server 110, and will then be instructed by the WWW server 110 to obtain these details from the caller. Accordingly, the client system 130 will collect the requested information and forward it to the WWW server 110, whereupon the desired response can be provided.
  • general WWW server 110 must be specifically adapted to handle voice browsing, since clearly for an audio telephone connection all graphics and such-like will be discarded.
  • the spoken interface also means that only a very limited amount of information can be presented to a caller in a reasonable time, compared to a normal HTML page.
  • the caller would normally be asked one question on the form at a time, rather than having the whole form presented to them in its entirety, as on a computer screen.
  • the WWW server 110 in the architecture of Figure 1 does not have to function as a conventional WWW server at all. Rather, in many implementations the server 110 is dedicated to voice applications, with user access only through an audio interface. Moreover, server 110 need not necessarily be connected to the Internet, but could be linked to the client system 130 via an intranet, extranet, or any other appropriate communications facility (which may simply be a point-to-point link, rather than a broader network).
  • VoiceXML is rapidly establishing itself as the de facto industry standard for interactive voice- enabled applications using an architecture such as shown in Figure 1.
  • VoiceXML is a scripting language based on (technically a schema of) the extensible mark-up language (XML), which in turn is a development of the HTML.
  • XML itself is described in many books, for example "XML for the Worldwide Web” by Elizabeth Castro, Peachpit Press, 2001 (ISBN 0-201-71098-6), and "The XML Handbook” by Charles Goldfarb and Paul Prescod, Prentice Hall, 2000 (ISBN 0-13-014714-1).
  • VoiceXML provides a platform independent language for writing voice applications based on audio dialogs using TTS and digitised audio recordings for output, and speech and DTMF key recognition for input.
  • dialogs There are two main types of dialog, a form, which presents information and gathers input, and a menu, which offers choices of what to do next. (In practice most applications have been developed using only forms, since these can also be used to implement a menu structure).
  • the VoiceXML code is downloaded from a server onto a client providing a VoiceXML browser to render the VoiceXML code to a caller
  • VoiceXML Probably the major attraction of VoiceXML is that it insulates the application writer from needing to know anything about the underlying telephony platform. VoiceXML also has the advantage of being specifically structured to support voice dialogs, although as a programming environment it does suffer from certain limitations and complexities, as will be described in more detail below. A copy of the formal VoiceXML specification plus other information about VoiceXML can be downloaded from www.voicexml.org.
  • VoiceXML VoiceXML
  • Practical deployment systems are almost invariably implemented using dynamically generated VoiceXML, in which at least a portion of the relevant code is only created in response to a particular client request.
  • One typical reason for this is that the data required for a response (such as availability and pricing of a particular item) is almost always stored in a (separate) database system. If a user then requests information about such an item, the response is created on the fly by importing the current availability and pricing data from the database into the returned output, thereby ensuring that the user receives up-to-date information.
  • WWW servers typically use something like Perl-based CGI (common gateway interface) processes or Java servlets in order to handle dynamic page creation (Java is a trademark of Sun Microsystems Inc.). This is a much more challenging implementation framework that a static environment. Experience available to date suggests that in practice there are serious problems involved in building and maintaining dynamic VoiceXML application in this context.
  • CGI common gateway interface
  • client-side code is VoiceXML
  • server-side code is Java or Perl. It is therefore non-trivial to move functionality from one side to the other, since this normally requires the relevant code to be rewritten in a completely different language.
  • advantages and disadvantages in siting code on the server rather than client (or vice versa) and the relative merits of these may change with time or circumstances, or simply be impossible to predict accurately in advance.
  • TellMe Network also supply the"Jumpstart Perl Server Package” fl t ⁇ p://studio.tellme.com/downloads/VoiceXML-Sefver/Server.html " ). which makes it possible to write CGI-based VoiceXML applications in nearly pure Perl. This package will then generate VoiceXML code for execution on the client to perform simple speech input and output as required by the application. Note however that neither of these two packages provides any significant flexibility in terms of movement of code between the server and client in VoiceXML applications.
  • EP-A-1100013 describes a novel XML-based language, referred to as CML (conversational markup language) that can be used to generate multi-modal dialogues.
  • CML conversational markup language
  • a single dialogue can be developed which can then be transformed as appropriate into HTML, WML, VoiceXML, and so on.
  • CML yet another specialised XML language
  • this level of complexity and generality is not needed in many situations where a WWW server site is developed with a specific access mode in mind (e.g. by telephone), and the requirement is to optimise behaviour for this particular access mode.
  • WO 01/73755 describes a system for developing a specialised class of Web-based voice applications that use speech recognition.
  • a Web application with voice functionality can be written in standard script languages such as Jscript or Perlscript, with interface objects used by the script being provided as Active X objects.
  • the script and interface objects are then downloaded for execution on a client system having speech recognition capabilities.
  • the system bypasses VoiceXML altogether, but uses instead a "Teller" interface to process the application on the client. This approach is therefore limited to systems that support such an interface, in contrast to VoiceXML applications that are portable across a range of systems .
  • a method of developing a voice application for a client- server environment The server supports a high-level procedural language in which data objects have a first format, and the client supports a voice mark-up language in which data objects have a second format.
  • the method begins with the steps of writing the voice application in a high-level procedural language, and providing one or more annotations for the voice application.
  • the annotations are indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client.
  • the part of the voice application to be performed on the client is the transformed from the high-level procedural language into the voice mark-up language supported by the client in accordance with the annotations, and the part of the voice application that is to be executed on the server is modified to associate data objects in the first format on the server with data objects in the second format on the client.
  • This approach allows an application to be readily developed as normal code in a high-level procedural language such as Java, without worrying about the details of the client-server interactions. Rather, these are controlled by the set of annotations, which can then be updated later as desired, without having to modify the original application.
  • This separation makes application development much easier, in that an application having the correct logic flow can first be developed in a familiar environment to perform the processing relevant to both server and client. In a subsequent phase, the annotations are added to determine where each part of the application is to be executed. Functionality can thus be moved from client to server, or vice versa, with only minimal changes that do not affect the application's control logic.
  • a set of speech functions that can be invoked from the voice application in a high-level procedural language.
  • these functions can be provided as a set of methods within a utility class.
  • the speech functions are necessarily to be performed on the client, and so any invocations of these speech functions are automatically transformed into the voice mark-up language supported by the client. This simplifies matters for the developer, in that there is no need to include specific annotations for such speech functions. In fact, if no annotations are provided at all, then only these minimum speech functions will be performed on the client, with all the remaining processing being performed on the server.
  • the voice application can divided into three portions: a first portion, which is to be executed on the client, and second and third portions, which are to be executed on the server.
  • the second portion comprises code that interacts directly with the first portion
  • the third portion comprises code that does not interact directly with the first portion.
  • the first portion is transformed from the high-level procedural language into the voice mark-up language supported by the client (typically VoiceXML).
  • the second portion is modified to associate data objects in a first format on the server with data objects in a second format on the client.
  • the third portion is generally not subject to modification (although may be slightly adapted, for example to conform to the particular web server configuration employed).
  • the annotations are used to explicitly identify functions that belong to the second portion of the voice application. If we regard the application functions as being arranged in an invocation hierarchy, than those above the annotated functions belong to the third portion (since the application must commence on the server), while those below the annotated functions belong to the first portion. More specifically, the latter can be identified automatically by determining the transitive closure of functions called by the annotated functions. It will be appreciated that by only having to identify the subset of functions that actually transfer control from the server to the client, the annotation task is considerably simplified.
  • the modification of methods in the second portion in this embodiment involves the replacement of the function by a corresponding proxy that is used to store information specifying the associations between data objects on the server and data objects on the client. (This may also involve a suitable adaptation of the invoking code in the third portion to call the proxy, rather than the original function). These associations are important to ensure that the server and client code behave as a coherent unit.
  • the proxy translates the arguments of its corresponding function into code in the voice mark-up language, and then transfers the code to the client for execution.
  • This facility represents one mechanism for handling dynamic data objects (i.e. those that are only specified at run-time), and so greatly extends the range of applications that can be developed using the above approach.
  • a method of developing a voice application for a client-server environment Typically the server supports a high-level procedural language in which data objects have a first format, while the client supports a voice mark-up language in which data objects have a second format.
  • the method begins with writing the voice application in a high-level procedural language. A part of the voice application is to be executed on the server, and a part of the voice application is to be executed on the client.
  • the voice application is then compiled, and the part of the voice application that is to be executed on the client platform is transformed into the voice mark-up language, while the part of the voice application that is to be executed on the server is modified in order to associate data objects in the first format on the server with data objects in the second format on the client.
  • This approach allows all the application code to be written in a single high level procedural language.
  • this language is Java (or more specifically a subset of Java), but other languages such as C, C++ and so on could be used instead.
  • This has the advantage of generally being a much more familiar programming environment than a standard voice mark-up language, so it is easier for users to attract and retain developers and support staff with the requisite experience.
  • the voice mark-up language is VoiceXML.
  • VoiceXML VoiceXML
  • conditional constructions and loop constructions in the high-level procedural language can be compiled into VoiceXML conditional subdialog calls and recursive VoiceXML subroutine calls, respectively.
  • the VoiceXML specification includes support for ECMAScript-compatible code.
  • functions that are to be executed on the client platform in the voice mark-up language and that do not directly call basic speech functions on the client are compiled into such ECMAScript- compatible code.
  • the motivation for this is that the performance of ECMAScript code on most VoiceXML platforms tends to be better than general VoiceXML.
  • the functions that directly call basic speech functions are retained in VoiceXML, since ECMAScript does not support this functionality.
  • a method of running a voice application for a client-server environment in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format.
  • the method starts with commencing the voice application on the server, and the voice application then runs on the server until processing is to be transferred to the client.
  • code in the voice mark-up language is dynamically generated on the server from the voice application in a high-level procedural language. This dynamically generated code supports the transformation of server-side data objects from the first format into the second format.
  • the dynamically generated code in the voice mark-up language is then rendered from the server to the client for execution on the client.
  • the voice application is normally commenced in response to a request from the client, which may be received over any appropriate communications facility.
  • client request itself is typically generated in response to an incoming telephone call to the client.
  • the voice application comprises three portions: a first portion that is to be executed on the client, and which is transformed from the high-level procedural language into the voice mark-up language supported by the client; a second portion that is to be executed on the server to interact directly with the first portion, and which is modified to associate data objects in the first format on the server with data objects in the second format on the client; and a third portion that is to be executed on the server, but that does not interact directly with the first portion.
  • the second portion is responsible for dynamically generating code on the server in the voice mark-up language, the dynamically generated code supporting the transformation of the at least one data object from said first format into said second format.
  • This dynamically generated code is then combined with said first portion for rendering from the server to the client for execution on the client.
  • the first portion of the code may itself be dynamically generated at run-time, but in one preferred embodiment is previously generated through a compilation process. This improves performance by avoiding the need to have to generate the first portion of code in the voice mark-up language each time the voice application is run.
  • the second portion maintains a table indicating the association between data objects on the server in the first format and data objects on the client in the second format.
  • the updated version of the object received from the client can be matched to the original version on the server, and this original version then updated accordingly.
  • Another embodiment of the invention provides apparatus for developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format.
  • the voice application is written in a high-level procedural language, and accompanied by one or more annotations indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client.
  • the apparatus comprises: means for transforming the part of the voice application to be performed on the client from the high-level procedural language into the voice mark-up language supported by the client in accordance with the annotations; and means for modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client.
  • Another embodiment of the invention provides apparatus for developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format.
  • the voice application is written in a high-level procedural language.
  • the apparatus comprises a compiler for performing a compilation process on the voice application and includes: means for transforming the part of the voice application that is to be executed on the client platform into the voice mark-up language as part of the compilation process; and means for modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client as part of the compilation process.
  • Another embodiment of the invention provides a server for running a voice application in a client-server environment, in which the server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format.
  • the server includes an application server system for launching the voice application (which includes at least one data object in the first format).
  • the voice application then runs on the server until processing is to be transferred to the client.
  • the server further includes a dynamic compiler for generating code on the server in said voice mark-up language from the voice application, wherein the dynamically generated code supports the transformation of the data object from the first format into the second format, and a communications facility for rendering the dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
  • Another embodiment of the invention provides a computer program for use in developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format.
  • the voice application is written in a high-level procedural language and accompanied by one or more annotations indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client.
  • the program comprises instructions to perform the steps of: transforming the part of the voice application to be performed on the client from the high-level procedural language into the voice mark- up language supported by the client in accordance with the annotations; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client.
  • Another embodiment of the invention provides a compiler for developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format.
  • the voice application is written in a high-level procedural language
  • the compiler includes program instructions for performing a compilation process on the voice application.
  • the compilation process includes: transforming the part of the voice application that is to be executed on the client platform into the voice mark-up language; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client.
  • Another embodiment of the invention provides a computer program providing a platform for running a voice application in a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format.
  • the program includes instructions for commencing the voice application (which involves at least one data object in the first format) on the server, and running the voice application on the server until processing is to be transferred to the client.
  • the program instructions support the dynamic generation of code on the server in the voice mark-up language, wherein the dynamically generated code supports the transformation of the data object from the first format into the second format.
  • the program instructions then render the dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
  • the above computer programs are executed on one or more machines.
  • execution includes interpreting, such as for some Java code, or rendering, such as for mark-up languages.
  • the computer programs may be preinstalled on disk storage relevant machines, or supplied as a computer program product.
  • Such program product which typically comprises the program instructions stored in/on a medium, may be downloaded over a network (such as the Internet), or supplied as a physical storage device, such as a CD ROM.
  • the program instructions are usually first copied into main memory (RAM) of the machine, and then executed by the processor(s) of the machine.
  • program instructions are normally copied and saved onto disk storage for the machine, and then are executed from this disk storage version, although they may also be executed directly from the CD ROM, etc. It will be appreciated that the apparatus and computer program/computer program product embodiments of the invention will generally benefit from the same preferred features as described above with reference to method embodiments of the invention.
  • the above approach greatly simplifies the process of developing and deploying interactive spoken dialogue applications implemented in a voice mark-up language, such as dynamic VoiceXML, for use in a client-server environment.
  • a compiler, a run-time environment, and other associated software modules are provided, which in the preferred embodiment enable specification of the application in a subset of Java equipped with some speech utility classes.
  • the application program in Java can then be compiled into either server-side or client-side code as desired. More particular, the compilation process results in a mixture of Java and VoiceXML, with the Java running on the server side and the VoiceXML on the client side.
  • the distribution of code between the client side and the server side is controlled by a set of annotations, as set out in a file or other suitable facility.
  • the voice application can also be compiled to run in Java- only environment (typically on a single system).
  • Figure 1 is a schematic illustration of the general use of a voice browser
  • Figure 2 is a schematic diagram illustrating the main components in the voice application development system of the present invention
  • Figure 3 is a simplified schematic diagram illustrating the main components involved in running a voice application in a single processor environment
  • Figure 4 is a flowchart illustrating the compilation of a voice application into dynamic VoiceXML
  • Figure 5 is a simplified schematic diagram illustrating the main components involved in running a voice application in a dynamic client-server VoiceXML environment
  • Figure 6 is a flowchart illustrating the steps performed in running a voice application in a dynamic client-server VoiceXML environment
  • Figures 7A-7K illustrate the communications between the components of Figure 5 in performing the steps of Figure 6; and Figure 8 is a flowchart providing an overview of the voice application development process.
  • Figure 2 depicts the main components of the voice application development environment as disclosed herein. It will be appreciated that the underlying motivation of this environment is to generally allow a voice application to be efficiently developed for use in a configuration such as shown in Figure 1.
  • the main components illustrated in Figure 2 are: (1) The SpeechJava language definition 10, including a utility class SpeechIO, which carries out basic speech input and output operations;
  • a single processor runtime environment 20 including a suitable implementation of the SpeechIO class, which enables execution of SpeechJava applications as normal Java programs;
  • a "dynamic compiler” 40 which converts a SpeechJava program, together with a small set of annotations, into a dynamic VoiceXML program comprising a standard Java program and a collection of static VoiceXML pages.
  • the annotations specify which parts of the application are to be run on the server, and which parts on the client; and
  • a dynamic VoiceXML runtime environment 50 implemented on top of the standard Tomcat gateway or a similar piece of software, which enables execution of code generated by the dynamic compiler.
  • Tomcat is a freeware server technology available from http://jakarta.apache.org/).
  • the dynamic VoiceXML environment 50 depends on the dynamic SpeechJava to VoiceXML compiler 40, which in turn depends on the static SpeechJava to VoiceXML compiler 30.
  • the single-processor environment depends only on the SpeechJava language definition 10.
  • SpeechJava represents a subset of Java of the programming language, and in one embodiment contains at least the following constructs:
  • Switch statements 10. Definitions of inner classes. These classes can contain definitions of data members, but not necessarily anything else. 11. At least the following types of expressions: a. Arithmetic expressions; b. Relational expressions; c. Array element expressions; d. 'new' expressions.
  • SpeechJava language definition incorporates a newly defined utility class (“SpeechIO”) that provides low-level speech functions for application input and output.
  • the SpeechIO class contains methods for at least the following operations: a. speech recognition using a specified grammar; b. speech output using a recorded wavfile; c. speech output using a text-to-speech engine.
  • a further output utility class (“RecStructure”) is also provided that represents the result of performing a speech recognition operation (this can be incorporated into the SpeechIO class if desired).
  • SpeechJava programs can be developed and run in any normal Java environment equipped with an implementation of the SpeechIO class.
  • One easy way to do this is to implement a server which can carry out the basic speech input and output operations, and then realise the SpeechIO class as a client to this server (note that the client and server do not necessarily have to be on the same system, for example if remote method invocation is used).
  • the RecStructure class can be implemented as an extension of Hashtable or some similar class, and may be provided as an inner class for example of the SpeechIO class.
  • FIG 3 This is illustrated in Figure 3, where a SpeechJava application 301 runs in Java environment 305.
  • the SpeechJava application 301 calls methods in the SpeechIO class 302 in order to perform voice input/output operations.
  • Figure 3 does not show the RecStructure class separately, rather this is treated as a component of the SpeechIO class 302).
  • the SpeechIO class 302 can be regarded in effect as a wrapper for functionality provided by the underlying voice platform 303. This platform is typically outside the Java environment 305, but can be accessed by suitable native language calls from the SpeechIO class 302. It will be appreciated of course that Figure 3 represents a standard architecture for writing conventional (non Web-based) voice applications in Java.
  • This compiler converts SpeechJava programs into equivalent static VoiceXML programs.
  • the basic idea is to realise definitions of static SpeechJava methods as VoiceXML ⁇ form> elements, invocations of static SpeechJava methods as VoiceXML ⁇ subdialog> elements, and instances of SpeechJava inner classes as Javascript (ECMAScript) objects. Most of this process can be carried out using standard techniques, as described in more detail below.
  • the first limitation can be overcome by noting that although VoiceXML syntax does not permit insertion of a ⁇ subdialog> element into an ⁇ if> element, it is nevertheless possible to give a
  • conditional will occur in a context where local variables are defined, and these variables may be referenced in the body of the conditional. It is thus necessary to pass the local variables into the new subdialogs sub_if and sub_else, return their new values on exiting the subdialogs, and use the returned values to update the local variables in the translated conditional. b. It may also happen that one or both of the branches of the conditional contains an occurrence of 'return'.
  • SpeechJava into static VoiceXML is normally only possible for code that can be executed entirely on the client (this happens to be the case for the example application given in section 3 below). More generally, when at least some portions of the original Java code are to be run on the server, either for reasons of efficiency or because some server processing is necessary in the application, then a dynamic compiler must be used, as will now be described.
  • Compiler 40 transforms annotated SpeechJava programs into Java-based dynamic VoiceXML programs.
  • the resulting code from the compilation process comprises a Java program to be executed on the server, together with a set of one or more pieces of VoiceXML to be executed on the client.
  • Annotations are utilised to control which Java methods are to be executed on the server (hence remaining as Java), and which are to be executed on the client (hence being compiled in VoiceXML).
  • a developer can use the annotations to identify those methods that are desired to transfer processmg from the server to the client.
  • the compiler knows that the SpeechIO methods must be implemented on the client (since only this has speech facilities), and so will automatically convert these into the appropriate VoiceXML, with the necessary transfer of control from the server.
  • the annotations therefore allow a developer to specify additional processing to be performed on the client. If no annotations are provided, then only the basic minimum of speech operations will be performed on the client.
  • step 410 Internalise the source code and the annotations (step 410). This transforms the code into representations of flow, method calls, and so on that are easier to work with. Note that internalisation per se is well-known in the art. Indeed, in the preferred embodiment, the internalisation is performed using a publicly available piece of freeware, namely the ANTLR parser-generator, together with the accompanying grammar for Java (see www.antlr.org).
  • step 430 Use the call graph and the annotations (step 430) to separate the method declarations into three groups: a. Normal server side methods. These will stay as Java. b. Client side methods. These will become VoiceXML. c. Server side methods that transfer control to the client side. These will be replaced by special
  • the separation into the three groups is performed based on the knowledge that the program starts on the server side, and remains there until a transfer is encountered. Such a transfer can either be explicitly identified in the annotations file, or else implicit (thus a call to a method in the SpeechIO class must cause a transfer to the client, since only the client supports the necessary audio input/output facilities). Note that each transfer is effective for the duration of a single called client method, after which processing is returned back to the server via a "submit" operation. 4. For each method in group (c) above, use the call graph to compute the transitive closure of the method under the invocation relation (step 440).
  • each call to the client comprises a single method
  • this called method may in turn invoke further methods that are also performed on the client (in the same manner as a conventional function or program stack).
  • these further methods have completed, we return to the originally called client method, and then back to the server.
  • This set of methods comprises the transitive closure of the called method.
  • Each method in the set is therefore translated into VoiceXML using the SpeechJava to static VoiceXML compiler, as previously described.
  • a corresponding "proxy" method is created (step 450) that performs the following actions: a. Call a utility method to translate the arguments of the group (c) method into a piece of VoiceXML code, and store a table associating server side objects with client side objects. b. Combine this piece of VoiceXML code with the VoiceXML code compiled in (4). c. Render out the combined VoiceXML to the client. d. Wait for a new submit from the client containing the returned information. e. Decode the returned information, and if necessary update server-side objects and/or compute a return value.
  • step 460 For each method in group (a) above modify as follows (step 460): a. Replace calls to methods in group (c) above with calls to the corresponding proxy methods. b. Replace calls to SpeechIO primitives with calls to the corresponding methods that render out client-side code.
  • server side units are converted back to external Java syntax (step 470).
  • Figure 5 illustrates the components of the client/server environment for execution of the compiled SpeechJava voice application programs in dynamic VoiceXML mode. Note that this environment includes both client-side and server-side processing.
  • Figure 5 illustrates a server platform 110 and a client platform 130 (mirroring the arrangement of Figure 1).
  • the client and server are connected by a link 501 that supports HTTP communications.
  • Link 501 may therefore be implemented over the Internet if so desired. It will be appreciated that any other appropriate communications facility or protocol could be utilised instead, provided they were appropriately supported by the various components.
  • both the client and server platforms each comprise standard desktop computers running the Windows NT operating system (Windows 2000), available from Microsoft Corporation.
  • the server side software 502 includes the Tomcat server implementation previously mentioned, as available from http://jakarta.apache.org, which creates new Tomcat servlets in response to incoming requests from clients.
  • the server includes the following main components:
  • the server side Java code 510 produced by the dynamic VoiceXML compiler. Note that this is running on a Java virtual machine 505 (i.e. a Java run-time environment).
  • the (pre-compiled) VoiceXML code 520 as produced by the dynamic VoiceXML compiler.
  • An HTTP gateway process 540 This incorporates a Gateway Server 541, a Gateway Client
  • a utility class 530 that is responsible for translating at run-time between server-side (Java) and client-side (VoiceXML) representations. For this purpose, utility class 530 maintains a correspondence table 535, which stores associations between objects on the client and on the server. Note that this utility class 530 may if desired be split into multiple classes, and in some embodiments may be incorporated into the gateway process 540.
  • the client system 130 includes a standard VoiceXML browser 550, which communicates with the server-side gateway process via HTTP.
  • the VoiceXML browser 550 is used to render the VoiceXML code 551, which is downloaded from the server.
  • browser 550 is implemented by the V-Builder program, available from Nuance Corporation (www.nuance.com).
  • client system 130 is then provided with a SoundBlaster audio card to provide audio input/output.
  • the production version of this embodiment (which unlike the development environment is telephony enabled and supports multiple sessions) utilises the Nuance Voice Webserver program to provide the VoiceXML browser 550, which renders the VoiceXML code 551.
  • Client side system 506 then incorporates suitable telephony interface software and hardware (not shown in Figure 5), as supported by the Nuance Voice Web Server product. (See http://www.nuance.com/products/voicexml.html for more details of these various Nuance products).
  • suitable telephony interface software and hardware not shown in Figure 5
  • the client begins by sending a 'run' request 710 to the gateway process 540 on the server over link 501 (step 605, Figure 7A).
  • This request specifies the name of a particular desired application, and will normally be generated in response to an incoming call to client system 130. Typically this will be a conventional telephone call over a land or mobile network, although some clients may support Voice over Internet Protocol (VoIP) calls, or some other form of audio communication.
  • client system 130 may send a run request 710 as part of outbound call processing (as well as or instead of inbound call processing).
  • the particular application request sent to the server 110 may be dependent on one or more parameters such as the called number, the calling number, the time of day, and so on.
  • the gateway process 540 communicates with the server side software 502 in order to start executing the named application program as a new thread 510 (step 610, Figure 7B). It will be appreciated that the processing so far conforms (per se) to existing VoiceXML applications.
  • a proxy method is one that results in a call to be performed on the client. (Note that if there are no such proxy methods encountered, the processing effectively goes straight to step 670, described below).
  • the proxy method in the Java application 510 calls a utility method from the translation utility class 530 to translate the arguments 720 of the proxy method into a piece of VoiceXML code 730, which is then returned back to the proxy method (step 620, Figure 7D).
  • the utility method also stores a table 535 that contains the associations between server side objects in the Java application 510 and the corresponding client side objects in the returned VoiceXML code 730.
  • the proxy method combines the newly generated VoiceXML code 730 received back from the utility class with the appropriate portion 735 of the pre-compiled VoiceXML code 520.
  • This latter component represents a translation of the set of client methods called from this particular proxy method (step 625, Figure 7E). It will be understood therefore that much of the VoiceXML code for execution on the client can be determined (statically) in advance from the original Java code. However, some of the client VoiceXML code can only be generated dynamically at run-time, based for example on the particular request from the client or on particular information in a database.
  • the VoiceXML code 740 comprising the combination produced in the preceding step of the statically prepared VoiceXML code 735 with the dynamically created VoiceXML code 730, is passed to the HTTP gateway process 540, which then renders it out over communications link 501 to the client 130 (step 630, Figure 7F).
  • VoiceXML code 740 as downloaded to the client in Figure 7F corresponds to VoiceXML code 551 as illustrated in Figure 5).
  • This VoiceXML browser 550 receives the VoiceXML code 740 from the server 110, and starts to execute this received VoiceXML code 740 (step 640, Figure 7G). Meanwhile, the server-side process thread in the Java application 510 suspends.
  • the VoiceXML browser 550 client executes the received VoiceXML code 740 until it returns to the top level call of this code.
  • the VoiceXML code forming this top level call concludes with a ⁇ submit> element. This triggers a return to the server-side gateway process 540, passing back a return value or values 750 if appropriate (step 645, Figure 7H). This allows, for example, spoken information recognised during the call to be submitted back to the Java application 510 for processmg on the server side.
  • the HTTP gateway process 540 on the server side 110 wakes up the relevant thread in Java application 510 (this thread having been suspended as part of step 640).
  • the gateway process 540 then passes the reawakened server thread the return information received from the client as part of the submit process of the preceding step (step 650, Figure 71). Note that at this stage the returned information is still in the form of VoiceXML objects 760 (i.e. as received from the Voice XML code on the client 130).
  • the thread in the Java application 510 that received the VoiceXML objects 760 from the gateway process calls a method (or methods) in the translation utility class 530 to translate these methods back into Java, (step 655, Figure 7J).
  • the VoiceXML objects submitted back from the client are passed to the translation utility class 530, which uses the table created in step 620 above to decode the contents of these objects.
  • This content 770 can then be returned to the Java application 510 by updating existing server-side objects, creating new server-side objects, or creating a call return value (or some combination of these three actions).
  • the proxy method is now able to continue processing, and to eventually complete and return. This leads to the resumption of normal server-side processing of Java application 510 (step 660, Figure 7K).
  • the continued processing of the relevant thread of Java application 510 on the server 110 may lead to one or more further proxy methods being called, if there are more portions of code to be run on client 130. If this turns out to be the case (step 665), then processing returns to repeat stages (3) through (11), as just described. 13. Finally all the proxy methods in the relevant Java application 510 have been completed, and so the server-side program is ready to terminate. At this point, there is an outstanding HTTP request from the client (given the request/response model of client-server HTTP communications). Thus the gateway process 540 renders out a piece of null VoiceXML to the client 130 in order to formally satisfy this remaining request (step 670). This then allows this thread of the server Java application 510 to conclude. Likewise, the client 130 may conclude the call, or may perform additional processing associated with the call that does not involve server 110.
  • FIG. 8 provides an overview of the voice application development process in this embodiment. This commences with writing the application in the SpeechJava format (step 810). Once the application has been written, it can be tested on a single processor Java platform provided with suitable audio facilities (step 820). This testing is useful to verify correct application behaviour, although can be omitted if desired.
  • the next step in Figure 8 is to compile the application into static Voice XML (step 830). Note that this is only feasible if the entire application can potentially be run on the client. Thus once the complete application has been statically compiled into VoiceXML, it is possible to test the VoiceXML code on the client (step 840). This can be useful for understanding application behaviour and performance. Note however that the static compilation and associated testing can be omitted if desired; indeed, they must be omitted if the application includes operations that can only be performed on the server, since in this case the static compilation will not be viable.
  • step 850 the annotations are developed to control whether processing is to be performed on the server or on the client.
  • the annotations can be considered as optional, in that if no annotations are provided, the system defaults to performing only the basic speech input/output operations on the client.
  • the application can now be compiled into dynamic VoiceXML (step 860), using the annotations to control the location of the relevant processing, and finally installed onto the server system, ready for use (step 870).
  • step 860 could potentially be postponed until run-time (i.e. after the installation of step 870). This would allow the annotation file to be specified at run-time (perhaps for example dependent on the type of client that initiated the request to the server). However, in the preferred embodiment the compilation is done in advance. This avoids the repetition of having to perform substantially the same compilation process for each user request. Nevertheless, it is not possible to create all the VoiceXML code in advance; rather the generation of a certain proportion of the VoiceXML code, connected with data object transfer between the server and client, must be deferred until run-time (see step 620 of Figure 6).
  • a single server can then deploy these different versions, and determine which one to utilise in response to a given client request.
  • Section 2.1 presents one preferred definition of the SpeechJava language
  • Section 2.2 presents a single-processor environment for running SpeechJava programs
  • Section 2.3 describes in detail one particular SpeechJava to static VoiceXML compiler
  • Section 2.4 describes in detail one particular SpeechJava to dynamic VoiceXML compiler
  • Section 2.5 describes in detail one particular environment for the execution of SpeechJava programs that have been compiled into dynamic VoiceXML.
  • SpeechJava is a subset of Java equipped with an extra utility class called SpeechIO. Static methods are used in effect as though they were C-style functions, and inner classes as though they were C-style structs. Input and output are handled through the SpeechIO class.
  • SpeechJava contains the following constructs: 1. Definitions of top-level classes, of the form
  • ⁇ className> is an identifier
  • ⁇ Body> is a list of definitions of static methods and/or inner classes.
  • RecStructure extending Hashtable which is intended to represent the results of calling recognition.
  • a RecStructure object associates int or String values with a set of String slots. It contains the following methods: a) String getSlotValueAsString (String slotName) Returns the value of a String-valued slot, or null if it has no value.
  • SpeechIO A utility class for low-level speech functions called SpeechIO, containing the following methods: a) RecStructure Recognise (String grammar)
  • SpeechJava programs conforming to the above description can be developed and run in any normal Java environment equipped with an implementation of the SpeechIO and RecStructure classes.
  • One easy way to implement the RecStructure class (which can potentially be incorporated into the SpeechIO class) is as an extension of Hashtable.
  • the SpeechIO class can be realised by first implementing a server (corresponding to the voice platform 303 of Figure 3) that can carry out the basic speech input and output operations.
  • This server can be built on top of any standard speech recognition platform, for example a toolkit from Nuance Corporation (see above).
  • This server provides a minimal speech application, whose top-level loop reads messages from an input channel and acts on them in an appropriate way. Messages will be of the following three types:
  • the server sends a return message over an output channel containing either the recognition result, or a notification that recognition failed for some reason.
  • the SpeechIO class itself (302 in Figure 3) can be implemented as a client to the server described above. Each of the three SpeechIO methods then functions by sending an appropriate message to the server, and in the case of the Recognise message also waiting for a return value.
  • This section describes one embodiment of a compiler that converts annotated SpeechJava programs conforming to the framework of Section 2.1 above into equivalent VoiceXML programs.
  • the annotations for indicating whether code is to be performed on the server or on the client are provided as a separate input file.
  • the compiler performs this conversion as a sequence of three main steps:
  • the first stage is performed in one embodiment using known methods.
  • the parsing of the Java code to internal form is done using a version of the freeware ANTLR parser-generator, together with the accompanying freeware ANTLR grammar for Java ( ' www.antlr.org ' ).
  • this has a simple structure and can be parsed using simple ad hoc methods.
  • the third stage above renders abstract VoiceXML into executable VoiceXML.
  • the compiled VoiceXML is initially in abstract or internalised form for easier manipulation (similar to the internalised form of Java produced at step 410 in Figure 4). It will be appreciated that techniques for converting back from this abstract form into a final executable form are well-known in the art, and accordingly will not be described in further detail herein.
  • VoiceXML in contrast to Java, does not permit expressions to contain subdialog calls
  • Java expressions in general translate into two components: a VoiceXML expression, and a list of zero or more statements which are executed before the statement in which the target expression appears. If this list is non-empty, this is generally because it contains a ⁇ subdialog> item.
  • Translate_expression consequently takes the following arguments:
  • Each recognition grammar used as an argument to an invocation of SpeechIO . recognise is the subject of a declaration using a line of the form grammar grammar ⁇ slot_l, slot_2, ..., slotji ⁇ where grammar is the name of the grammar as it appears in the invocation of SpeechIO . recognise, and slot_l, ... slot ⁇ are the names of the slots filled by grammar.
  • Each grammar declaration is translated into a ⁇ form> item corresponding to a recognition call involving the defined grammar.
  • the declaration above translates into a ⁇ form> with the following appearance:
  • definitions of inner classes are internalised and stored for use in translating other constituents, but are not directly translated into output VoiceXML.
  • a static method definition is translated into one or more ⁇ form> elements, as follows:
  • the body of the method is translated into abstract VoiceXML, potentially including one or more ancillary ⁇ form> elements resulting from translation of conditional or loop elements.
  • the list of current local variables is initialised to the list of formal parameters.
  • SpeechIO methods recognise, sayWavfile and sayTTS are translated as follows: recognise A method invocation of the form:
  • SpeechIO.recognise(gr ⁇ 7?z/w ⁇ /") is translated into a VoiceXML fragment of the form:
  • subdialog src #recognition_subdialog_for_ j gr ⁇ / «?« ⁇ " name- 1 'subdialog_l "/> where subdlalog_l is a new subdialog identifier.
  • Declarations are translated into ⁇ var> items. If initial values are supplied, these are either translated into 'expr' attributes, or into assignments on the newly defined variables.
  • the following examples illustrate how the translation is carried out: int i;
  • method invocations are translated into ⁇ subdialog> items, but the form of the translation depends on whether the method invocation occurs in a 'statement' or an 'expression' context. If the method invocation appears as a statement, then it is directly translated as a ⁇ subdialog> item. If the method invocation however is part of an expression, it is translated in the output VoiceXML expression as the expression : subdialog_l .return_value where subdialogj is a newly generated identifier, and the ⁇ subdialog> item is added to the output list of 'extra statements' produced by the relevant call to translate_expression. In both cases, the list of actual parameters to the method invocation is translated using translate_expression, and the resulting output list of 'extra statements' is added to the current list of 'extra statements'.
  • the basic strategy is to define two new subdialogs, (call them cond_sub_l and cond_sub_2), that respectively encode the 'if and 'else' branches of the conditional.
  • the compiler then recodes the conditional in terms of conditional subdialog calls by introducing a new variable whose value is set depending on the result of the conditional's test. This is followed by calls to cond_sub_l and cond_sub_2 conditional on appropriate values of the branch variable.
  • Assignment statements are translated as appropriate either into VoiceXML ⁇ assign> elements, or into ECMAScript (JavaScript) assign statements wrapped in ⁇ script> elements.
  • the ⁇ assign> element is used if the left-hand side of the assignment is a simple variable, and the ⁇ script> element otherwise.
  • Java numerical literals are translated as VoiceXML numerical literals.
  • Java string literals are translated as VoiceXML string literals.
  • the special Java constant 'null' is translated as the VoiceXML string "undefined”.
  • Java data member expressions of the form: class Jnstance.datajnemberjiame are translated as ECMAScript object property references of the form: class Jnstance.datajnemberjiame
  • Java array element expressions of the form: array _instance[index] are translated as ECMAScript array element expressions of the form: array_instance[index]
  • Java arithmetic operators '+', '-', '*', 7' and '%' are translated into the ECMAScript arithmetic operators of the same names.
  • Java string concatenation operator '+' is translated into the ECMAScript operator of the same name.
  • tmp_var_l is the new temporary variable
  • the value associated with the structurejype key encodes the information that tmp_var_l is an array of two objects of type String
  • the value associated with the identity_type key is a unique new tag.
  • SpeechJava method cannot be translated into ECMAScript if it includes calls to SpeechIO primitives, since speech operations can only be carried out within a VoiceXML form, and JavaScript functions have no mechanism for calling VoiceXML forms.
  • the compiler carries out a static analysis of the input SpeechJava program to determine those method definitions that will be performed on the client side and that can be compiled into ECMAScript.
  • the compiler first constructs a call graph, in which the client side methods are nodes and method invocations are arcs. Methods are then labelled as being either 'VoiceXML' or 'ECMAScript', as follows:
  • Step (3) is repeated until a fixed point is reached. 5) All remaining methods are labelled 'ECMAScript'.
  • Methods labelled 'VoiceXML' are translated as described earlier in this section, and the remaining methods can be translated into ECMAScript.
  • This translation of Java into ECMAScript is reasonably straightforward (compared to the translation into VoiceXML per se), and in broad terms involves the translation of Java static methods into ECMAScript function definitions. The skilled person will then be able to map Java control primitives and operators into ECMAScript control primitives and operators without undue difficulty.
  • Java data structures can be mapped into the same ECMAScript data structures as in the VoiceXML case.
  • a table is utilised to keep track of whether each method is realised as VoiceXML code or
  • This section describes in detail one embodiment of a compiler for implementing the method sketched in Section 1.4 above in order to transform annotated SpeechJava programs into Java-based dynamic VoiceXML programs.
  • the annotations specify which Java methods are to be executed on the server (hence remaining as Java), and which are to be executed on the client (hence being compiled in VoiceXML).
  • the convention used is that the annotations explicitly specify a set of zero or more methods that are to transfer processing to the client. The transitive closure of this set is then run on the client.
  • class is the name of a class
  • method is the name of a method in that class with arity arguments.
  • a typical line might be: execute_on_client get_a_numberl .hello/0
  • the code produced by the compiler comprises a Java program to be executed on the server, together with a set of one or more pieces of VoiceXML to be executed on the client.
  • the top-level steps carried out by the compiler have already been described in Section 1.4 above (see also Figure 4). We now discuss each of these steps in more detail for one particular embodiment.
  • the Java code is internalised using the ANTLR parser for Java referred to in Section 2.3 above, and the annotation file is internalised using straightforward ad hoc methods. Methods listed in the annotations file are marked in a table as execute jm lient methods; these are the methods that are to transfer processing to the client.
  • the call graph is constructed by recursively traversing the internalised Java code and noting the following: a) Instances of invocations of method M_l inside method definition M_2, for some M_l, M_2. In this case, an arc of the form calls(M_2, M_l) is added to the call graph; b) Instances of invocations of the form SpeechIO .recognise(G_l) inside method definition M_2, for some named grammar G_l. In this case, an arc of the form uses_grammar(M_2, G_l) is added to the call graph.
  • server_side_call(M_l, M_2) so that server_side_call(M_l, M_2) holds iff call(M_l, M_2) holds and M_2 is not an execute_on_client method, and let server_side_call* be the reflexive and transitive closure of server_side_call. Then we divide up the set of methods as follows:
  • the server-side methods consist of the set of all methods M such that server_side_call*(main,
  • the client-side methods consist of the remaining methods. These methods are to be translated into VoiceXML.
  • this transitive closure is translated into VoiceXML using the SpeechJava to static VoiceXML compiler, using the appropriate declaration for each of the grammars G found in step 2.
  • the result is written out to a file; call this file client_side_code(M).
  • a corresponding "proxy" method M' For each execute_on_client method M, a corresponding "proxy" method M' is created.
  • the signature of the original method be: return Jype M(typej) f irgj), type f irgj , ... typejifjxrgji) where the return type of the method is return Jype, the names of the arguments are f irgj) ... f trgji, and their types are typej) ... typeji.
  • the name of the file that contains the client-side code for M be client jide ode(M).
  • arg_i' is the expression new Integer(arg_i). Otherwise, arg_i' is arg_i.
  • arg__i' should be always be a proper object, as opposed to a primitive type.
  • args ⁇ arg_0 ⁇ arg_ T, ... arg_n' ⁇ ; gatewayServer. convertArgsToVXML("/ * etHr «_t)>pe", "M", f_args, args); return gatewayServer.sendVXML('Vetw «j >e", " client jide ode(M) ");
  • gatewayServer.ConvertArgsToVXML takes as input the arguments of M, packaged as the Object array args, the names of the formal arguments, packaged as the String array f_args, the name of the return type, and the name of the method M itself. It uses this information to generate a piece of VoiceXML, which calls the code in client ide ode(M). In general, this involves creating client-side objects that correspond to the server-side objects in the arguments of M, so it also stores a table that associates each client-side object with its corresponding server-side object.
  • gatewayServer.sendVXML combines the VoiceXML code produced in step 1 with the precompiled client-side code in client jide_code(M), and renders this out to the client. It then waits for the client to perform a new submit which will contain the returned information. When this is received, it decodes it and if necessary updates server-side objects and/or computes a return value.
  • a call to SpeechlO.sayTTS would for example be replaced by a call to the proxy method execute_client_sayTTS, defined as follows:
  • the server- side code is subjected to a final transformation, which for each top-level class class does the following: 1. class is made to extend the class 'GatewayRunnable'.
  • class is provided with a public method called 'run', as follows: public void run() ⁇ main_jproxy(); gatewayServer.end(); ⁇
  • server-side methods are then rendered out in standard Java syntax using a simple recursive descent algorithm.
  • the implementation includes a GatewayRunnable class and the Gateway.
  • Server-side programs are classes that implement the GatewayRunnable interface. Communication between the voice-browser client and the server program first goes to a Tomcat Servlet (see http://jakarta.apache.org/. as previously mentioned); communication between the Tomcat Servlet 543 and the server-side program 510 is through the Gateway process 540 (see Figure 5).
  • the top-level modules in the Gateway are the following:
  • the Gate way Client class The Gate way Client class.
  • the FileClassLoader class The GatewayServer class.
  • the SharedMemory class Execution starts when the servlet accepts a URL request to run a new program, specified as a string program.
  • the server passes this request to the GatewayClient 542, which invokes the FileClassLoader to locate and run a class named program. Since this class extends GatewayRunnable, which in turn extends Runnable, program can be started as a separate thread.
  • the instance of program communicates with the voice-browser client through its private instance of GatewayServer 542; this communicates with the GatewayClient 542, which in turn communicates with the Tomcat Servlet 543.
  • the GatewayServer and GatewayClient pass information through an instance of the SharedMemory class.
  • the proximate interface between the instance of program and the Gateway consists of the two GatewayServer methods, convertArgsToVXML and sendVXML:
  • convertArgsToVXML takes the run-time information pertaining to the method invocation, and creates a small piece of VoiceXML that acts as a 'header' for the main piece of VoiceXML that has been produced and saved at compile-time.
  • the server-side Java objects are translated into ECMAScript counterparts, and the correspondence between them is saved in a server-side table.
  • sendVXML combines the header information produced by convertArgsToVXML with the pre-compiled VoiceXML, and renders it out to the client. It then waits for the next client request, which should contain updated versions of the objects that have been sent from the server, possibly together with a return value.
  • the server-side correspondence table is used to translate the data back into server- side form, update server-side objects where appropriate, and if necessary produce a server-side return value.
  • Section 3.1 presents an illustrative example of a SpeechJava program and its translation into static and dynamic VoiceXML.
  • the program itself is presented in Section 3.1, and the static VoiceXML translation in Section 3.2.
  • Section 3.3 presents a dynamic VoiceXML translation.
  • Section 3.4 describes the run-time processing carried out by the dynamic VoiceXML program.
  • the example program is a Java class called fmd_largest_number, which contains four static methods.
  • the program uses text to speech (TTS) conversion to prompt the user for three numbers, finds the largest one, and speaks the result. The way in which this is done has been chosen to display many of the features of the SpeechJava language and its compilation into static and dynamic VoiceXML.
  • TTS text to speech
  • SpeechIO.sayTTS ("Component " + i + "I heard " + numbers[i]); ⁇
  • the first branch involves only items whose translation can be included in the scope of an ⁇ if> form, so their translations can be left in place.
  • the output consists of a Java file, representing the code to be run on the server, and a VoiceXML file, representing the pre-compiled portion of the code to be run on the client.
  • the Java file is as follows, with comments as before in italics:
  • f_args ⁇ "numbers", “prompts”, “size” ⁇ ;
  • Objectf] args ⁇ arg_0, arg_l, new Integer(arg_2) ⁇ ; gatewayServer.convertArgsToVXML("void”, “get_number_array”, f_args, args); gatewayServer.sendVXML("void”, “get_number_array_3.vxml”); ⁇
  • the VoiceXML file get_number_array_3.vxml comprises the relevant subset of the file produced by the static VoiceXML compiler, presented in Section 3.2 above. Specifically, it contains the definitions of the five forms 'get_number_array', 'while_sub_ ', 'get_number', 'cond_sub_l' and 'recognition_subdialog_for_NUMBER'.
  • Section 3.3 is executed in the runtime environment of Section 2.5.
  • the initial steps (all of which can be considered as routine) are omitted.
  • arg_0 an int array with three uninitialised elements.
  • arg_l a String array with three elements, whose values are "Say a number", "Say another number” and
  • arg_2 an int with value 2.
  • f_args a String array whose three elements have the values "numbers", “prompts” and “size”.
  • args an Object array whose three elements have the values arg_0, arg_l and Integer(2).
  • the invocation is : gatewayServer.convertArgsToVXML("void”, “get_number_array”, f_args, args)
  • the purpose of this call is to produce a "header" piece of VoiceXML that makes a ⁇ subdialog> invocation of a form.
  • This invocation returns a void value (the first argument); the VoiceXML ⁇ form> called is called "get_number_array" (the second argument); the names of the formal parameters for the call are in f_args (the third argument); and the run-time values of these parameters are in args (the fourth argument).
  • the following piece of VoiceXML is produced, with comments as before in italics:
  • variable 'references' is used to hold client-side franslations of all the server-side objects that are passed to the call.
  • the code in convertArgsToVXML also constructs an association list which associates the Java objects passed to the call (here arg_0 and arg_l) with the corresponding indices in 'references'.
  • the list associates arg_0 (the int array) with references[2], and arg_l (the String array) with referencesfl].
  • the following call to sendVXML combines the dynamically generated VoiceXML code immediately above with the pre-compiled VoiceXML fragment get_number_array_3.vxml described at the end of Section 3.3. It then sends the result through the Gateway to the client, and waits for the next Gateway request, which should be the result of the ⁇ submit> at the end of the dynamically generated VoiceXML fragment.
  • the Gateway receives this new request, it passes the body of the ⁇ submit> back to the sendVXML method.
  • the user responded "seven", "five" and
  • the final step is to use the association list to unpack this information back to the original server-side data-structures. Since the list associates arg_0 (the int array) with element 2 of 'references', the numbers 7, 5 and 10 are correctly entered into elements 0, 1 and 2 respectively of the int array arg_0, thereby completing the call.
  • Section 4.1 discusses using a larger subset of Java
  • Section 4.2 discusses using procedural languages other than Java
  • Section 4.3 discusses using voice scripting languages other than VoiceXML.
  • Section 4.1.3 considers the question of whether there are any Java constructs for which there may not be a suitable translation into VoiceXML.
  • the basic strategy for translating non-static methods into VoiceXML is first to reduce them to functions, with one function for each method name. Since the architecture tags each generated VoiceXML object with the type of the Java object to which it is intended to correspond, it is then possible to use run-time dispatching to delegate a method invocation to the translation of the individual method appropriate to object on which the method was invoked, together with its arguments.
  • MClassl is created from the definitions of the method M in Classl by adding the Classl object o on which the method is invoked as an extra argument.
  • Direct references to data members of Classl in the definition of the method M are then translated into corresponding references to the data members of o. For example, suppose Classl has a String data member called 'message', and M is defined as follows:
  • MClassl is then defined as:
  • a partial treatment of exceptions can be implemented in a straightforward manner by translating them into a special type of return value. This would be similar to the treatment of local variables in conditionals and iterative loops described in Sections 2.3.8 and 2.3.10. This type of solution is expected to work adequately for user-generated exceptions, i.e. exceptions intentionally created using the 'throw' construction; 'throw' will just translate into a special kind of 'return' statement.
  • a generalised scheme for handling exceptions not generated by user code, such as an exception resulting from a division by zero, is somewhat more problematic, although partial solutions that are likely to be sufficient in most practical situations are again feasible.
  • SALT Speech Application Language Tags
  • a voice browser could be employed.
  • the user could connect to the client via a computer network rather than a conventional telephone network (using a facility such as Voice of the Internet).
  • a computer network rather than a conventional telephone network (using a facility such as Voice of the Internet).
  • voice browsers on other forms of client system, some potentially quite different from a standard interactive voice response system.
  • the client is a Personal Digital Assistant (PDA).
  • PDA Personal Digital Assistant
  • the PDA still functions as a VoiceXML client, and is therefore directly compatible with the approach described above.
  • this wide range in the nature of the potential client device underlines the usefulness of annotations, in that the optimum distribution of processing between the server and client will clearly vary according to the properties of the client for any given application environment.

Abstract

A method is provided for developing and running voice applications in a client-server environment. The server platform supports a high-level procedural language, such as Java, whereas the client platform supports a voice mark-up language, such as VoiceXML. The method comprises the steps of first writing the voice application in a high-level procedural language, and then providing one or more annotations for the voice application. The annotations serve to indicate which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client. The latter part of the voice application (i.e. the portion to be performed on the client) is then transformed form the high-level procedural language into the voice mark-up language supported by the client, in accordance with the annotations. Most of this transformation can be completed statically in advance, but the remainder of the transformation is performed dynamically at run-time.

Description

A SYSTEM AND METHOD FOR CREATING VOICE APPLICATIONS
Field of the Invention
The present invention relates to the development and deployment of applications for access via a voice browser or similar in a client-server environment.
Background of the Invention
The last decade has seen rapid growth in the Worldwide Web (WWW) on the Internet, from its origins as a research tool for particle physicists, through to its current position as a major global information source for a very wide range of contexts, including commercial, academic, governmental, and so on. Indeed, the WWW has moved beyond the mere provision of data, and is also widely used for performing transactions, such as on-line shopping, making travel reservations, and so on.
The WWW is based on pages presented in Hypertext Markup Language (HTML), which are accessed from a server over the Internet by a client using the hypertext transport protocol (HTTP). The client, typically a conventional personal computer, normally runs browser software for this purpose, such as Microsoft Internet Explorer. This type of client generally connects to the Internet though a modem/telephone link to an Internet Service Provider (ISP), or over a local area network (LAN) to an Internet gateway. Modern desktop computers generally support a wide range of multimedia capabilities, including graphics, sound, animation, and so on, which can be exploited by WWW content.
However, there are many situations where a person may want to use the WWW or the Internet, but does not have access to a suitable computer. For example, many households simply do not have computers, which are relatively expensive, complex and unreliable (compared to other consumer devices), as well as tending to depreciate rapidly. Alternatively, a person may be travelling away from their normal home or business location.
Various attempts have been made to accommodate such situations, primarily via the telephone network. One approach is to provide data services over cellular networks, using technologies such as Wireless Markup Language (WML) and the associated Wireless Application Protocol (WAP) (respectively mirroring HTML and HTTP for the mobile environment). However, given that mobile phones are generally designed to be as small as possible, it can be difficult to provide a satisfactory graphical or textual user interface on such devices. In addition, services like WAP/WML can only be utilised over networks and from phones which specifically provide support for this format; this normally prevents access from the vast majority of currently installed landline telephones. A somewhat different approach is illustrated in Figure 1, where a WWW server 110 provides content that is accessible over the Internet 120 (or any other suitable form of data connection, such as intranet, LAN, etc). Also shown in Figure 1 is a client system 130, which interacts with the WWW server 110 over the Internet 120. Client system 130 is further connected to a conventional telephone 150 via the public switched telephone network (PSTN) 140 (although mobile/cellular telephone networks, etc could also be used).
The client system 130 acts as an intermediary in the overall architecture between the WWW server 110 and the user of telephone 150, and is sometimes referred to as a voice browser. In operation, the user of telephone 150 typically dials the number corresponding to client system 130, which may be implemented in known fashion by an interactive voice response (IVR) system. The client system 130 accesses WWW server 110 in order to retrieve information for handling the call, converts the information into audio using a text to speech (TTS) capability, and then transmits this audio over the telephone network to the user at telephone 150. Conversely, client system 130 can also receive audio input from the user of telephone 150 and convert this into a form suitable for transmission back to WWW server 110. Such audio input is generally in the form of dual tone multiple frequency (DTMF) key presses and/or spoken input. To support this latter option, client system 30 includes a speech recognition (Reco) system.
A typical caller transaction using the system of Figure 1 is likely to involve an audio dialogue between the caller and the client system 130, and/or a HTTP based dialogue between the client system 130 and the WWW server 110. For example, if the caller is ringing up to check performance details at a theatre, then the caller will have to specify the relevant date and time. In some circumstances the client system 130 may know itself to prompt the caller for this information, and then send a complete request to the WWW server 110. In other circumstances, the client system 130 may send an initial request without date/time information to the WWW server 110, and will then be instructed by the WWW server 110 to obtain these details from the caller. Accordingly, the client system 130 will collect the requested information and forward it to the WWW server 110, whereupon the desired response can be provided.
It will be appreciated therefore that there is some flexibility in how to distribute application intelligence between the WWW server 110 and the client system 130. The former option can be regarded as cleaner from an architectural perspective, since application knowledge is focussed at a single site (the WWW server), thereby making facilitating management and control. On the other hand, the latter option will tend to improve performance, since it eliminates certain communications between the client system 130 and the WWW server 110, and so will reduce waiting time for the caller.
Note that the content of general WWW server 110 must be specifically adapted to handle voice browsing, since clearly for an audio telephone connection all graphics and such-like will be discarded. The spoken interface also means that only a very limited amount of information can be presented to a caller in a reasonable time, compared to a normal HTML page. In addition, when asking the caller to complete a form for an on-line purchase (for example), the caller would normally be asked one question on the form at a time, rather than having the whole form presented to them in its entirety, as on a computer screen.
Of course, it will be appreciated that the WWW server 110 in the architecture of Figure 1 does not have to function as a conventional WWW server at all. Rather, in many implementations the server 110 is dedicated to voice applications, with user access only through an audio interface. Moreover, server 110 need not necessarily be connected to the Internet, but could be linked to the client system 130 via an intranet, extranet, or any other appropriate communications facility (which may simply be a point-to-point link, rather than a broader network).
VoiceXML is rapidly establishing itself as the de facto industry standard for interactive voice- enabled applications using an architecture such as shown in Figure 1. VoiceXML is a scripting language based on (technically a schema of) the extensible mark-up language (XML), which in turn is a development of the HTML. XML itself is described in many books, for example "XML for the Worldwide Web" by Elizabeth Castro, Peachpit Press, 2001 (ISBN 0-201-71098-6), and "The XML Handbook" by Charles Goldfarb and Paul Prescod, Prentice Hall, 2000 (ISBN 0-13-014714-1).
VoiceXML provides a platform independent language for writing voice applications based on audio dialogs using TTS and digitised audio recordings for output, and speech and DTMF key recognition for input. There are two main types of dialog, a form, which presents information and gathers input, and a menu, which offers choices of what to do next. (In practice most applications have been developed using only forms, since these can also be used to implement a menu structure). The VoiceXML code is downloaded from a server onto a client providing a VoiceXML browser to render the VoiceXML code to a caller
Probably the major attraction of VoiceXML is that it insulates the application writer from needing to know anything about the underlying telephony platform. VoiceXML also has the advantage of being specifically structured to support voice dialogs, although as a programming environment it does suffer from certain limitations and complexities, as will be described in more detail below. A copy of the formal VoiceXML specification plus other information about VoiceXML can be downloaded from www.voicexml.org.
It is relatively easy to use VoiceXML to develop simple applications in the form of sets of static VoiceXML pages. These are pages that are created and stored in advance on WWW server 110. Practical deployment systems, however, are almost invariably implemented using dynamically generated VoiceXML, in which at least a portion of the relevant code is only created in response to a particular client request. One typical reason for this is that the data required for a response (such as availability and pricing of a particular item) is almost always stored in a (separate) database system. If a user then requests information about such an item, the response is created on the fly by importing the current availability and pricing data from the database into the returned output, thereby ensuring that the user receives up-to-date information.
WWW servers typically use something like Perl-based CGI (common gateway interface) processes or Java servlets in order to handle dynamic page creation (Java is a trademark of Sun Microsystems Inc.). This is a much more challenging implementation framework that a static environment. Experience available to date suggests that in practice there are serious problems involved in building and maintaining dynamic VoiceXML application in this context.
Two issues stand out in particular. Firstly, it is in general hard to develop and debug an application in which the code to be executed on the client is dynamically generated by a server process (it is much easier to work with an architecture in which static code is executed on a single process). Secondly, the client-side code and server-side code are quite different. Typically, client-side code is VoiceXML, while server-side code is Java or Perl. It is therefore non-trivial to move functionality from one side to the other, since this normally requires the relevant code to be rewritten in a completely different language. However, as mentioned above, there are advantages and disadvantages in siting code on the server rather than client (or vice versa), and the relative merits of these may change with time or circumstances, or simply be impossible to predict accurately in advance. Thus the process of developing and tuning an application frequently requires code to be moved in this way. For example, a small change in functionality may mean that an operation previously executed on the client has to be moved to the server, since it now requires access to server-side data. Conversely, performance issues may require server-side code to be moved to the client, so as to minimise latency due to HTTP delay between the client and the server.
The literature already describes a number of tools that allow VoiceXML applications to be specified more or less abstractly. For example, TellMe Networks (www.tellme.com) supply the "VoiceXML Perl Module" (hUp://studio.tellme.com/dovvnloads/code-vxmlpm.html). This allows programmers to write applications as Perl modules (Perl is a well-known scripting language), which are then rendered into VoiceXML. Each VoiceXML element is associated with a corresponding Perl subroutine, implying that an understanding of the general structure of VoiceXML is required in order to utilise the package effectively. TellMe Network also supply the"Jumpstart Perl Server Package" fl tτp://studio.tellme.com/downloads/VoiceXML-Sefver/Server.html"). which makes it possible to write CGI-based VoiceXML applications in nearly pure Perl. This package will then generate VoiceXML code for execution on the client to perform simple speech input and output as required by the application. Note however that neither of these two packages provides any significant flexibility in terms of movement of code between the server and client in VoiceXML applications.
EP-A-1100013 describes a novel XML-based language, referred to as CML (conversational markup language) that can be used to generate multi-modal dialogues. Thus a single dialogue can be developed which can then be transformed as appropriate into HTML, WML, VoiceXML, and so on. However, this requires the developer to understand yet another specialised XML language (CML), and appears to be limited to a certain class of applications. Furthermore, this level of complexity and generality is not needed in many situations where a WWW server site is developed with a specific access mode in mind (e.g. by telephone), and the requirement is to optimise behaviour for this particular access mode.
WO 01/73755 describes a system for developing a specialised class of Web-based voice applications that use speech recognition. A Web application with voice functionality can be written in standard script languages such as Jscript or Perlscript, with interface objects used by the script being provided as Active X objects. The script and interface objects are then downloaded for execution on a client system having speech recognition capabilities. The system bypasses VoiceXML altogether, but uses instead a "Teller" interface to process the application on the client. This approach is therefore limited to systems that support such an interface, in contrast to VoiceXML applications that are portable across a range of systems .
Summary of the Invention
Accordingly, the invention provides methods, apparatus, and computer programs as defined in the appended claims.
In one embodiment, there is provided a method of developing a voice application for a client- server environment. The server supports a high-level procedural language in which data objects have a first format, and the client supports a voice mark-up language in which data objects have a second format. The method begins with the steps of writing the voice application in a high-level procedural language, and providing one or more annotations for the voice application. The annotations are indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client. The part of the voice application to be performed on the client is the transformed from the high-level procedural language into the voice mark-up language supported by the client in accordance with the annotations, and the part of the voice application that is to be executed on the server is modified to associate data objects in the first format on the server with data objects in the second format on the client.
This approach allows an application to be readily developed as normal code in a high-level procedural language such as Java, without worrying about the details of the client-server interactions. Rather, these are controlled by the set of annotations, which can then be updated later as desired, without having to modify the original application. This separation makes application development much easier, in that an application having the correct logic flow can first be developed in a familiar environment to perform the processing relevant to both server and client. In a subsequent phase, the annotations are added to determine where each part of the application is to be executed. Functionality can thus be moved from client to server, or vice versa, with only minimal changes that do not affect the application's control logic. This therefore allows the application to be optimised for a particular client- server environment by suitable tuning of the annotations, without any risk of accidentally corrupting the correct behaviour of the application itself. Note that the annotations are typically set out in a conventional text file, but any other appropriate mechanism can be used.
In a preferred embodiment, a set of speech functions that can be invoked from the voice application in a high-level procedural language. In a Java environment, these functions can be provided as a set of methods within a utility class. The speech functions are necessarily to be performed on the client, and so any invocations of these speech functions are automatically transformed into the voice mark-up language supported by the client. This simplifies matters for the developer, in that there is no need to include specific annotations for such speech functions. In fact, if no annotations are provided at all, then only these minimum speech functions will be performed on the client, with all the remaining processing being performed on the server.
In one embodiment, the voice application can divided into three portions: a first portion, which is to be executed on the client, and second and third portions, which are to be executed on the server. The second portion comprises code that interacts directly with the first portion, whereas the third portion comprises code that does not interact directly with the first portion. The first portion is transformed from the high-level procedural language into the voice mark-up language supported by the client (typically VoiceXML). The second portion is modified to associate data objects in a first format on the server with data objects in a second format on the client. The third portion is generally not subject to modification (although may be slightly adapted, for example to conform to the particular web server configuration employed).
In this particular embodiment, the annotations are used to explicitly identify functions that belong to the second portion of the voice application. If we regard the application functions as being arranged in an invocation hierarchy, than those above the annotated functions belong to the third portion (since the application must commence on the server), while those below the annotated functions belong to the first portion. More specifically, the latter can be identified automatically by determining the transitive closure of functions called by the annotated functions. It will be appreciated that by only having to identify the subset of functions that actually transfer control from the server to the client, the annotation task is considerably simplified.
The modification of methods in the second portion in this embodiment involves the replacement of the function by a corresponding proxy that is used to store information specifying the associations between data objects on the server and data objects on the client. (This may also involve a suitable adaptation of the invoking code in the third portion to call the proxy, rather than the original function). These associations are important to ensure that the server and client code behave as a coherent unit. In particular, at run-time the proxy translates the arguments of its corresponding function into code in the voice mark-up language, and then transfers the code to the client for execution. This facility represents one mechanism for handling dynamic data objects (i.e. those that are only specified at run-time), and so greatly extends the range of applications that can be developed using the above approach.
In another embodiment of the invention there is provided a method of developing a voice application for a client-server environment. Typically the server supports a high-level procedural language in which data objects have a first format, while the client supports a voice mark-up language in which data objects have a second format. The method begins with writing the voice application in a high-level procedural language. A part of the voice application is to be executed on the server, and a part of the voice application is to be executed on the client. The voice application is then compiled, and the part of the voice application that is to be executed on the client platform is transformed into the voice mark-up language, while the part of the voice application that is to be executed on the server is modified in order to associate data objects in the first format on the server with data objects in the second format on the client.
This approach allows all the application code to be written in a single high level procedural language. In one embodiment this language is Java (or more specifically a subset of Java), but other languages such as C, C++ and so on could be used instead. This has the advantage of generally being a much more familiar programming environment than a standard voice mark-up language, so it is easier for users to attract and retain developers and support staff with the requisite experience. In addition, there are a wide variety of standard development, profiling and debugging tools available for use with languages such as Java.
In a preferred embodiment, the voice mark-up language is VoiceXML. This has the advantage of widespread industry support, but the disadvantage that it does not support certain common programming constructions. In order to get round this problem, conditional constructions and loop constructions in the high-level procedural language can be compiled into VoiceXML conditional subdialog calls and recursive VoiceXML subroutine calls, respectively.
The VoiceXML specification includes support for ECMAScript-compatible code. In one embodiment, functions that are to be executed on the client platform in the voice mark-up language and that do not directly call basic speech functions on the client are compiled into such ECMAScript- compatible code. The motivation for this is that the performance of ECMAScript code on most VoiceXML platforms tends to be better than general VoiceXML. However, the functions that directly call basic speech functions are retained in VoiceXML, since ECMAScript does not support this functionality.
In another embodiment of the invention there is provided a method of running a voice application for a client-server environment in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format. The method starts with commencing the voice application on the server, and the voice application then runs on the server until processing is to be transferred to the client. At this point, code in the voice mark-up language is dynamically generated on the server from the voice application in a high-level procedural language. This dynamically generated code supports the transformation of server-side data objects from the first format into the second format. The dynamically generated code in the voice mark-up language is then rendered from the server to the client for execution on the client.
The voice application is normally commenced in response to a request from the client, which may be received over any appropriate communications facility. Note that the client request itself is typically generated in response to an incoming telephone call to the client.
In a preferred embodiment, the voice application comprises three portions: a first portion that is to be executed on the client, and which is transformed from the high-level procedural language into the voice mark-up language supported by the client; a second portion that is to be executed on the server to interact directly with the first portion, and which is modified to associate data objects in the first format on the server with data objects in the second format on the client; and a third portion that is to be executed on the server, but that does not interact directly with the first portion.
In this embodiment, the second portion is responsible for dynamically generating code on the server in the voice mark-up language, the dynamically generated code supporting the transformation of the at least one data object from said first format into said second format. This dynamically generated code is then combined with said first portion for rendering from the server to the client for execution on the client. Note that the first portion of the code may itself be dynamically generated at run-time, but in one preferred embodiment is previously generated through a compilation process. This improves performance by avoiding the need to have to generate the first portion of code in the voice mark-up language each time the voice application is run.
In one preferred embodiment, the second portion maintains a table indicating the association between data objects on the server in the first format and data objects on the client in the second format. This supports the interchange of objects between the server and client platforms. For example, a data object may transmitted from the server to the client, updated on the client while in the second format, and then returned from the client to the server. In this case the data object received back from the client (in the second format) is transformed back into the first format. Using the table, the updated version of the object received from the client can be matched to the original version on the server, and this original version then updated accordingly.
Another embodiment of the invention provides apparatus for developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format. The voice application is written in a high-level procedural language, and accompanied by one or more annotations indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client. The apparatus comprises: means for transforming the part of the voice application to be performed on the client from the high-level procedural language into the voice mark-up language supported by the client in accordance with the annotations; and means for modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client.
Another embodiment of the invention provides apparatus for developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format. The voice application is written in a high-level procedural language. The apparatus comprises a compiler for performing a compilation process on the voice application and includes: means for transforming the part of the voice application that is to be executed on the client platform into the voice mark-up language as part of the compilation process; and means for modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client as part of the compilation process.
Another embodiment of the invention provides a server for running a voice application in a client-server environment, in which the server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format. The server includes an application server system for launching the voice application (which includes at least one data object in the first format). The voice application then runs on the server until processing is to be transferred to the client. The server further includes a dynamic compiler for generating code on the server in said voice mark-up language from the voice application, wherein the dynamically generated code supports the transformation of the data object from the first format into the second format, and a communications facility for rendering the dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
Another embodiment of the invention provides a computer program for use in developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format. The voice application is written in a high-level procedural language and accompanied by one or more annotations indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client. The program comprises instructions to perform the steps of: transforming the part of the voice application to be performed on the client from the high-level procedural language into the voice mark- up language supported by the client in accordance with the annotations; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client.
Another embodiment of the invention provides a compiler for developing a voice application for a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format. The voice application is written in a high-level procedural language, and the compiler includes program instructions for performing a compilation process on the voice application. The compilation process includes: transforming the part of the voice application that is to be executed on the client platform into the voice mark-up language; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in the second format on the client.
Another embodiment of the invention provides a computer program providing a platform for running a voice application in a client-server environment, in which a server supports a high-level procedural language in which data objects have a first format, and a client supports a voice mark-up language in which data objects have a second format. The program includes instructions for commencing the voice application (which involves at least one data object in the first format) on the server, and running the voice application on the server until processing is to be transferred to the client. At this point the program instructions support the dynamic generation of code on the server in the voice mark-up language, wherein the dynamically generated code supports the transformation of the data object from the first format into the second format. The program instructions then render the dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
The above computer programs are executed on one or more machines. Note that the development system may well not be the same as the run-time client server system. In this context, execution includes interpreting, such as for some Java code, or rendering, such as for mark-up languages. The computer programs may be preinstalled on disk storage relevant machines, or supplied as a computer program product. Such program product, which typically comprises the program instructions stored in/on a medium, may be downloaded over a network (such as the Internet), or supplied as a physical storage device, such as a CD ROM. For execution, the program instructions are usually first copied into main memory (RAM) of the machine, and then executed by the processor(s) of the machine. In addition, the program instructions are normally copied and saved onto disk storage for the machine, and then are executed from this disk storage version, although they may also be executed directly from the CD ROM, etc. It will be appreciated that the apparatus and computer program/computer program product embodiments of the invention will generally benefit from the same preferred features as described above with reference to method embodiments of the invention.
In summary therefore, the above approach greatly simplifies the process of developing and deploying interactive spoken dialogue applications implemented in a voice mark-up language, such as dynamic VoiceXML, for use in a client-server environment. A compiler, a run-time environment, and other associated software modules are provided, which in the preferred embodiment enable specification of the application in a subset of Java equipped with some speech utility classes. The application program in Java can then be compiled into either server-side or client-side code as desired. More particular, the compilation process results in a mixture of Java and VoiceXML, with the Java running on the server side and the VoiceXML on the client side. The distribution of code between the client side and the server side is controlled by a set of annotations, as set out in a file or other suitable facility. Note that for debugging purposes, the voice application can also be compiled to run in Java- only environment (typically on a single system).
Brief Description of the Drawings
Various embodiments of the invention will now be described in detail by way of example only with reference to the following drawings in which like reference numerals pertain to like elements and in which:
Figure 1 is a schematic illustration of the general use of a voice browser; Figure 2 is a schematic diagram illustrating the main components in the voice application development system of the present invention; Figure 3 is a simplified schematic diagram illustrating the main components involved in running a voice application in a single processor environment;
Figure 4 is a flowchart illustrating the compilation of a voice application into dynamic VoiceXML;
Figure 5 is a simplified schematic diagram illustrating the main components involved in running a voice application in a dynamic client-server VoiceXML environment;
Figure 6 is a flowchart illustrating the steps performed in running a voice application in a dynamic client-server VoiceXML environment;
Figures 7A-7K illustrate the communications between the components of Figure 5 in performing the steps of Figure 6; and Figure 8 is a flowchart providing an overview of the voice application development process.
Detailed Description
Figure 2 depicts the main components of the voice application development environment as disclosed herein. It will be appreciated that the underlying motivation of this environment is to generally allow a voice application to be efficiently developed for use in a configuration such as shown in Figure 1.
The main components illustrated in Figure 2 are: (1) The SpeechJava language definition 10, including a utility class SpeechIO, which carries out basic speech input and output operations;
(2) A single processor runtime environment 20, including a suitable implementation of the SpeechIO class, which enables execution of SpeechJava applications as normal Java programs;
(3) A "static compiler" 30, which converts SpeechJava programs into equivalent programs represented as single VoiceXML pages;
(4) A "dynamic compiler" 40, which converts a SpeechJava program, together with a small set of annotations, into a dynamic VoiceXML program comprising a standard Java program and a collection of static VoiceXML pages. The annotations specify which parts of the application are to be run on the server, and which parts on the client; and (5) A dynamic VoiceXML runtime environment 50, implemented on top of the standard Tomcat gateway or a similar piece of software, which enables execution of code generated by the dynamic compiler. (Tomcat is a freeware server technology available from http://jakarta.apache.org/).
Note that the dynamic VoiceXML environment 50 depends on the dynamic SpeechJava to VoiceXML compiler 40, which in turn depends on the static SpeechJava to VoiceXML compiler 30. In contrast, the single-processor environment depends only on the SpeechJava language definition 10.
Each of the components from Figure 2 will now be described in more detail. This will be done in two stages: firstly at a relatively high-level, to provide an understanding of the overall system; and secondly at a lower-level, to provide an understanding of the sort of particular implementation issues for any particular language environment. These two stages will then be followed by an example of a SpeechJava program being transformed into VoiceXML.
1.1 The SpeechJava Language Definition
SpeechJava represents a subset of Java of the programming language, and in one embodiment contains at least the following constructs:
1. Class definitions;
2. Definitions of static methods; 3. Variable declarations for variables of the following types: a. Primitive data-types; b. Class members; c. Arrays of primitive data-types or class members; 4. Method invocations; 5. Assignment statements; 6. Return statements: both simple, and returning a value;
7. Conditionals either of the form 'if-then' or of the form 'if-then-else';
8. Iterative loops of the forms 'while', 'do ... while' or 'for';
9. Switch statements; 10. Definitions of inner classes. These classes can contain definitions of data members, but not necessarily anything else. 11. At least the following types of expressions: a. Arithmetic expressions; b. Relational expressions; c. Array element expressions; d. 'new' expressions.
The above operations provide a C-like procedural subset of Java. Thus in a general sense static methods can be used in as though they were C-style functions, and inner classes as though they were C-style structs. Note that additional Java language commands and facilities may also be supported within the SpeechJava subset, as discussed in more below. This will allow a future extension to full object-oriented programming, without compromising upwards compatibility.
In addition, the SpeechJava language definition incorporates a newly defined utility class ("SpeechIO") that provides low-level speech functions for application input and output. The SpeechIO class contains methods for at least the following operations: a. speech recognition using a specified grammar; b. speech output using a recorded wavfile; c. speech output using a text-to-speech engine. A further output utility class ("RecStructure") is also provided that represents the result of performing a speech recognition operation (this can be incorporated into the SpeechIO class if desired).
1.2 A Single-Processor Environment for SpeechJava
SpeechJava programs can be developed and run in any normal Java environment equipped with an implementation of the SpeechIO class. One easy way to do this is to implement a server which can carry out the basic speech input and output operations, and then realise the SpeechIO class as a client to this server (note that the client and server do not necessarily have to be on the same system, for example if remote method invocation is used). The RecStructure class can be implemented as an extension of Hashtable or some similar class, and may be provided as an inner class for example of the SpeechIO class.
This is illustrated in Figure 3, where a SpeechJava application 301 runs in Java environment 305. The SpeechJava application 301 calls methods in the SpeechIO class 302 in order to perform voice input/output operations. (Figure 3 does not show the RecStructure class separately, rather this is treated as a component of the SpeechIO class 302).
The SpeechIO class 302 can be regarded in effect as a wrapper for functionality provided by the underlying voice platform 303. This platform is typically outside the Java environment 305, but can be accessed by suitable native language calls from the SpeechIO class 302. It will be appreciated of course that Figure 3 represents a standard architecture for writing conventional (non Web-based) voice applications in Java.
1.3 A SpeechJava to Static VoiceXML Compiler
This compiler converts SpeechJava programs into equivalent static VoiceXML programs. The basic idea is to realise definitions of static SpeechJava methods as VoiceXML <form> elements, invocations of static SpeechJava methods as VoiceXML <subdialog> elements, and instances of SpeechJava inner classes as Javascript (ECMAScript) objects. Most of this process can be carried out using standard techniques, as described in more detail below.
However, particular problems arise due to the following two (somewhat surprising) limitations of the VoiceXML language: 1. It is not straightforward to translate a SpeechJava procedure call inside an 'if-then' or 'if-then- else' construction into VoiceXML. This is due to the fact that VoiceXML <subdialog> elements can only be inserted into <forms> and not into <if> elements.
2. There is no straightforward way to implement a VoiceXML subdialog definition containing an iterative loop. Thus normally, a loop would be implemented using some kind of 'goto' construction at the bottom to form the loop back up to the beginning. However, executing the VoiceXML <goto> element after a subdialog call loses the calling context, and makes it impossible to execute a subsequent
<return>.
The first limitation can be overcome by noting that although VoiceXML syntax does not permit insertion of a <subdialog> element into an <if> element, it is nevertheless possible to give a
<subdialog> element a 'cond' attribute. This then makes execution of the call conditional on the value that the attribute evaluates to at run-time.
It is therefore possible to recode an arbitrary conditional from Java into the terms of VoiceXML conditional subdialog calls by introducing a new variable whose value is set depending on the result of the conditional test. Two new subdialogs are then defined (called 'sub_if and 'sub_else' for example). These encode the 'if and 'else' branches of the conditional respectively. Calls to both sub_if and sub_else can then be incorporated into the VoiceXML code, each being conditional on an appropriate value of the branch variable that corresponds to the original conditional test in the Java code. This basic picture is complicated by three further considerations: a. In general, the conditional will occur in a context where local variables are defined, and these variables may be referenced in the body of the conditional. It is thus necessary to pass the local variables into the new subdialogs sub_if and sub_else, return their new values on exiting the subdialogs, and use the returned values to update the local variables in the translated conditional. b. It may also happen that one or both of the branches of the conditional contains an occurrence of 'return'. This implies that the subdialogs sub_if and sub_else also need to return information indicating whether exit was due: (i) to reaching the end of the subdialog, or (ii) to processing a 'return' (which we will refer this latter possibility as a "non-local return"). If exit was due to processmg a nonlocal return, then the calling subdialog also needs to return. c. Since conditionals can be nested, the type of transformation described here will in general be applied recursively. For this reason, non-local returns in the sense of (2) above need to percolate up the calling chain until they reach the level of the original top level function. Return from this level is then treated as a normal return.
Consider now the second limitation mentioned above, concerning loops. Although it is not generally possible to code an iterative loop as such in VoiceXML, it is however possible to realise it as a recursive subroutine call. This can be done using techniques similar to those just described for handling conditionals. Thus the condition and body of the loop are translated into a new subdialog (called 'sub oop', for example) that has the following sequential structure:
1. Test the condition, and return if it is false.
2. Execute the body of the loop.
3. Recursively call sub_loop.
Note that there are issues with respect to local variables, return statements and nested subdialogs that are similar to those described above for conditional statements. In particular, these are addressed by ensuring that non-local returns (as previously defined) are passed up the chain of subdialog invocations. (This chain may include subdialogs derived from both conditionals and iterative loops).
A full compilation of SpeechJava into static VoiceXML is normally only possible for code that can be executed entirely on the client (this happens to be the case for the example application given in section 3 below). More generally, when at least some portions of the original Java code are to be run on the server, either for reasons of efficiency or because some server processing is necessary in the application, then a dynamic compiler must be used, as will now be described.
1.4 The Compilation of SpeechJava into Dynamic VoiceXML
Compiler 40 transforms annotated SpeechJava programs into Java-based dynamic VoiceXML programs. The resulting code from the compilation process comprises a Java program to be executed on the server, together with a set of one or more pieces of VoiceXML to be executed on the client. Annotations are utilised to control which Java methods are to be executed on the server (hence remaining as Java), and which are to be executed on the client (hence being compiled in VoiceXML). In particular, a developer can use the annotations to identify those methods that are desired to transfer processmg from the server to the client.
Note that the compiler knows that the SpeechIO methods must be implemented on the client (since only this has speech facilities), and so will automatically convert these into the appropriate VoiceXML, with the necessary transfer of control from the server. The annotations therefore allow a developer to specify additional processing to be performed on the client. If no annotations are provided, then only the basic minimum of speech operations will be performed on the client.
The compilation of SpeechJava into dynamic VoiceXML is illustrated in Figure 4. This process exploits the static compiler described in the preceding subsection, which turned SpeechJava into static VoiceXML. The procedure of Figure 4 is as follows:
1. Internalise the source code and the annotations (step 410). This transforms the code into representations of flow, method calls, and so on that are easier to work with. Note that internalisation per se is well-known in the art. Indeed, in the preferred embodiment, the internalisation is performed using a publicly available piece of freeware, namely the ANTLR parser-generator, together with the accompanying grammar for Java (see www.antlr.org).
2. Construct a call graph, whose nodes are methods and whose arcs are invocations of a method by another method (step 420).
3. Use the call graph and the annotations (step 430) to separate the method declarations into three groups: a. Normal server side methods. These will stay as Java. b. Client side methods. These will become VoiceXML. c. Server side methods that transfer control to the client side. These will be replaced by special
"proxy" methods, as described below in (5). Note that this transfer can either be made explicit (via the annotations), or implicit (if the program calls one of the three SpeechIO methods listed above).
The separation into the three groups is performed based on the knowledge that the program starts on the server side, and remains there until a transfer is encountered. Such a transfer can either be explicitly identified in the annotations file, or else implicit (thus a call to a method in the SpeechIO class must cause a transfer to the client, since only the client supports the necessary audio input/output facilities). Note that each transfer is effective for the duration of a single called client method, after which processing is returned back to the server via a "submit" operation. 4. For each method in group (c) above, use the call graph to compute the transitive closure of the method under the invocation relation (step 440). Thus although each call to the client comprises a single method, this called method may in turn invoke further methods that are also performed on the client (in the same manner as a conventional function or program stack). Once these further methods have completed, we return to the originally called client method, and then back to the server. This set of methods comprises the transitive closure of the called method. Each method in the set is therefore translated into VoiceXML using the SpeechJava to static VoiceXML compiler, as previously described.
5. For each method in group (c) above, a corresponding "proxy" method is created (step 450) that performs the following actions: a. Call a utility method to translate the arguments of the group (c) method into a piece of VoiceXML code, and store a table associating server side objects with client side objects. b. Combine this piece of VoiceXML code with the VoiceXML code compiled in (4). c. Render out the combined VoiceXML to the client. d. Wait for a new submit from the client containing the returned information. e. Decode the returned information, and if necessary update server-side objects and/or compute a return value.
6. For each method in group (a) above modify as follows (step 460): a. Replace calls to methods in group (c) above with calls to the corresponding proxy methods. b. Replace calls to SpeechIO primitives with calls to the corresponding methods that render out client-side code.
7. Finally, the server side units are converted back to external Java syntax (step 470).
Note that for reasons of efficiency, most of the dynamic compilation process is normally completed in advance, rather than being left to run-time. However, this is not possible for the VoiceXML code directly associated with the transfer of objects between the server and the client. Rather, this VoiceXML code is generated at run-time in response to particular client requests received by the server, and is then combined with the previously compiled code for rendering out to the client (as described in step 5 above).
1.5 The Execution of Compiled SpeechJava in a Dynamic VoiceXML Environment
Figure 5 illustrates the components of the client/server environment for execution of the compiled SpeechJava voice application programs in dynamic VoiceXML mode. Note that this environment includes both client-side and server-side processing. Figure 5 illustrates a server platform 110 and a client platform 130 (mirroring the arrangement of Figure 1). The client and server are connected by a link 501 that supports HTTP communications. Link 501 may therefore be implemented over the Internet if so desired. It will be appreciated that any other appropriate communications facility or protocol could be utilised instead, provided they were appropriately supported by the various components.
In one embodiment, both the client and server platforms each comprise standard desktop computers running the Windows NT operating system (Windows 2000), available from Microsoft Corporation. The server side software 502 includes the Tomcat server implementation previously mentioned, as available from http://jakarta.apache.org, which creates new Tomcat servlets in response to incoming requests from clients. In addition, the server includes the following main components:
1. The server side Java code 510 produced by the dynamic VoiceXML compiler. Note that this is running on a Java virtual machine 505 (i.e. a Java run-time environment).
2. The (pre-compiled) VoiceXML code 520, as produced by the dynamic VoiceXML compiler. 3. An HTTP gateway process 540. This incorporates a Gateway Server 541, a Gateway Client
542, and a Tomcat Servlet 543, whose operation are described in more detail in section 2.5 below. 4. A utility class 530 that is responsible for translating at run-time between server-side (Java) and client-side (VoiceXML) representations. For this purpose, utility class 530 maintains a correspondence table 535, which stores associations between objects on the client and on the server. Note that this utility class 530 may if desired be split into multiple classes, and in some embodiments may be incorporated into the gateway process 540.
The client system 130 includes a standard VoiceXML browser 550, which communicates with the server-side gateway process via HTTP. The VoiceXML browser 550 is used to render the VoiceXML code 551, which is downloaded from the server. In one embodiment, in a development environment, browser 550 is implemented by the V-Builder program, available from Nuance Corporation (www.nuance.com). In this environment, client system 130 is then provided with a SoundBlaster audio card to provide audio input/output. The production version of this embodiment (which unlike the development environment is telephony enabled and supports multiple sessions) utilises the Nuance Voice Webserver program to provide the VoiceXML browser 550, which renders the VoiceXML code 551. Client side system 506 then incorporates suitable telephony interface software and hardware (not shown in Figure 5), as supported by the Nuance Voice Web Server product. (See http://www.nuance.com/products/voicexml.html for more details of these various Nuance products). Of course, it will be appreciated that one of the advantages of developing VoiceXML applications is that they are not limited to any particular combination of hardware and software, but rather can be used with any platform that supports the VoiceXML standard.
The execution of the dynamic VoiceXML client-server application is illustrated in the flowchart of Figure 6. The communications between the various components shown in Figure 5 are illustrated in the series of Figures 7A-K (for clarity, certain components from Figure 5 have been omitted from Figure 7).
Figure 6 therefore illustrates the following steps:
1. The client begins by sending a 'run' request 710 to the gateway process 540 on the server over link 501 (step 605, Figure 7A). This request specifies the name of a particular desired application, and will normally be generated in response to an incoming call to client system 130. Typically this will be a conventional telephone call over a land or mobile network, although some clients may support Voice over Internet Protocol (VoIP) calls, or some other form of audio communication. In addition, client system 130 may send a run request 710 as part of outbound call processing (as well as or instead of inbound call processing). In known fashion, the particular application request sent to the server 110 may be dependent on one or more parameters such as the called number, the calling number, the time of day, and so on.
2. The gateway process 540 communicates with the server side software 502 in order to start executing the named application program as a new thread 510 (step 610, Figure 7B). It will be appreciated that the processing so far conforms (per se) to existing VoiceXML applications.
3. The program runs until it encounters a proxy method in the Java application 510 on the server
(step 615, Figure 7C). As previously explained, a proxy method is one that results in a call to be performed on the client. (Note that if there are no such proxy methods encountered, the processing effectively goes straight to step 670, described below).
4. The proxy method in the Java application 510 calls a utility method from the translation utility class 530 to translate the arguments 720 of the proxy method into a piece of VoiceXML code 730, which is then returned back to the proxy method (step 620, Figure 7D). The utility method also stores a table 535 that contains the associations between server side objects in the Java application 510 and the corresponding client side objects in the returned VoiceXML code 730.
5. The proxy method combines the newly generated VoiceXML code 730 received back from the utility class with the appropriate portion 735 of the pre-compiled VoiceXML code 520. This latter component represents a translation of the set of client methods called from this particular proxy method (step 625, Figure 7E). It will be understood therefore that much of the VoiceXML code for execution on the client can be determined (statically) in advance from the original Java code. However, some of the client VoiceXML code can only be generated dynamically at run-time, based for example on the particular request from the client or on particular information in a database.
6. The VoiceXML code 740, comprising the combination produced in the preceding step of the statically prepared VoiceXML code 735 with the dynamically created VoiceXML code 730, is passed to the HTTP gateway process 540, which then renders it out over communications link 501 to the client 130 (step 630, Figure 7F). (VoiceXML code 740 as downloaded to the client in Figure 7F corresponds to VoiceXML code 551 as illustrated in Figure 5).
7. This VoiceXML browser 550 receives the VoiceXML code 740 from the server 110, and starts to execute this received VoiceXML code 740 (step 640, Figure 7G). Meanwhile, the server-side process thread in the Java application 510 suspends.
8. The VoiceXML browser 550 client executes the received VoiceXML code 740 until it returns to the top level call of this code. By construction, the VoiceXML code forming this top level call concludes with a <submit> element. This triggers a return to the server-side gateway process 540, passing back a return value or values 750 if appropriate (step 645, Figure 7H). This allows, for example, spoken information recognised during the call to be submitted back to the Java application 510 for processmg on the server side.
9. Having received the return from the client system, the HTTP gateway process 540 on the server side 110 wakes up the relevant thread in Java application 510 (this thread having been suspended as part of step 640). The gateway process 540 then passes the reawakened server thread the return information received from the client as part of the submit process of the preceding step (step 650, Figure 71). Note that at this stage the returned information is still in the form of VoiceXML objects 760 (i.e. as received from the Voice XML code on the client 130).
10. The thread in the Java application 510 that received the VoiceXML objects 760 from the gateway process calls a method (or methods) in the translation utility class 530 to translate these methods back into Java, (step 655, Figure 7J). In order to achieve this, the VoiceXML objects submitted back from the client are passed to the translation utility class 530, which uses the table created in step 620 above to decode the contents of these objects. This content 770 can then be returned to the Java application 510 by updating existing server-side objects, creating new server-side objects, or creating a call return value (or some combination of these three actions).
11. The proxy method is now able to continue processing, and to eventually complete and return. This leads to the resumption of normal server-side processing of Java application 510 (step 660, Figure 7K).
12. The continued processing of the relevant thread of Java application 510 on the server 110 may lead to one or more further proxy methods being called, if there are more portions of code to be run on client 130. If this turns out to be the case (step 665), then processing returns to repeat stages (3) through (11), as just described. 13. Finally all the proxy methods in the relevant Java application 510 have been completed, and so the server-side program is ready to terminate. At this point, there is an outstanding HTTP request from the client (given the request/response model of client-server HTTP communications). Thus the gateway process 540 renders out a piece of null VoiceXML to the client 130 in order to formally satisfy this remaining request (step 670). This then allows this thread of the server Java application 510 to conclude. Likewise, the client 130 may conclude the call, or may perform additional processing associated with the call that does not involve server 110.
1.6 Overview of Voice Application Development Process
Figure 8 provides an overview of the voice application development process in this embodiment. This commences with writing the application in the SpeechJava format (step 810). Once the application has been written, it can be tested on a single processor Java platform provided with suitable audio facilities (step 820). This testing is useful to verify correct application behaviour, although can be omitted if desired.
The next step in Figure 8 is to compile the application into static Voice XML (step 830). Note that this is only feasible if the entire application can potentially be run on the client. Thus once the complete application has been statically compiled into VoiceXML, it is possible to test the VoiceXML code on the client (step 840). This can be useful for understanding application behaviour and performance. Note however that the static compilation and associated testing can be omitted if desired; indeed, they must be omitted if the application includes operations that can only be performed on the server, since in this case the static compilation will not be viable.
The method now proceeds to step 850, where the annotations are developed to control whether processing is to be performed on the server or on the client. As previously mentioned, the annotations can be considered as optional, in that if no annotations are provided, the system defaults to performing only the basic speech input/output operations on the client. The application can now be compiled into dynamic VoiceXML (step 860), using the annotations to control the location of the relevant processing, and finally installed onto the server system, ready for use (step 870).
It will be appreciated that if during subsequent testing or deployment it is desired to alter the balance of processing between the server and the client, it is not necessary to alter the Java application itself (step 810). Instead, only the annotation file needs to be updated (step 850).
Note that the compilation of step 860 could potentially be postponed until run-time (i.e. after the installation of step 870). This would allow the annotation file to be specified at run-time (perhaps for example dependent on the type of client that initiated the request to the server). However, in the preferred embodiment the compilation is done in advance. This avoids the repetition of having to perform substantially the same compilation process for each user request. Nevertheless, it is not possible to create all the VoiceXML code in advance; rather the generation of a certain proportion of the VoiceXML code, connected with data object transfer between the server and client, must be deferred until run-time (see step 620 of Figure 6).
One possibility is to utilise parallel versions of the same application, but compiled with different annotation files. It will be appreciated that this is a simpler maintenance task than having to support multiple versions of the same underlying application.
A single server can then deploy these different versions, and determine which one to utilise in response to a given client request. There are various criteria that may be used in making this determination. For example, the decision may be based on the current loading of the server and or network (if heavy this would suggest pushing more processmg onto the client). Likewise, the identity of the client may also be a factor. Thus many client IVR systems are physically located adjacent the server, and hence would expect to have a fast communications link with the server (this may favour more processing on the server). It may also be the case that certain types of clients may not support all VoiceXML features. The skilled person will be aware of other potential criteria to use in this determination (clearly the various factors can be combined if desired).
In other circumstances different versions of an application (i.e. as compiled with different annotation files) can be installed on different server systems, according to the configuration, deployment, usage patterns and so on for each particular server.
There now follows specific implementation details of one preferred embodiment of the present invention, concentrating in particular on the SpeechJava language, the compilers, and the runtime environment. The organisation corresponds to that previously adopted in that Section 2.1 presents one preferred definition of the SpeechJava language; Section 2.2 presents a single-processor environment for running SpeechJava programs; Section 2.3 describes in detail one particular SpeechJava to static VoiceXML compiler; Section 2.4 describes in detail one particular SpeechJava to dynamic VoiceXML compiler; and Section 2.5 describes in detail one particular environment for the execution of SpeechJava programs that have been compiled into dynamic VoiceXML.
2.1 Details of a SpeechJava Language Definition
In one embodiment, SpeechJava is a subset of Java equipped with an extra utility class called SpeechIO. Static methods are used in effect as though they were C-style functions, and inner classes as though they were C-style structs. Input and output are handled through the SpeechIO class. In this particular embodiment SpeechJava contains the following constructs: 1. Definitions of top-level classes, of the form
class (<ClassName> <Body>)
where <className> is an identifier and <Body> is a list of definitions of static methods and/or inner classes.
2. Definitions of inner classes. These classes can only contain definitions of data members.
3. A utility class called RecStructure extending Hashtable, which is intended to represent the results of calling recognition. A RecStructure object associates int or String values with a set of String slots. It contains the following methods: a) String getSlotValueAsString (String slotName) Returns the value of a String-valued slot, or null if it has no value.
b) int getSlotValueAsInt (String slotName) Returns the value of an int-valued slot, or null if it has no value.
c) String getSlotType (String slotName) Returns either "String" or "int" depending on the type of the slot's value, or null if it has no value.
4. A utility class for low-level speech functions called SpeechIO, containing the following methods: a) RecStructure Recognise (String grammar)
Perform speech recognition using the named grammar.
b) void sayWavfile (String Pathname) Perform speech output using the named recorded wavfile.
c) void sayTTS (String StringToSpeak)
Perform speech output on the specified string using a text-to-speech engine.
5. The following types: a) int, String and RecStructure. b ) Names of inner classes. c ) One-dimensional arrays of elements of types (a) or (b).
6. Definitions of static methods, of the form <Accessibility> static <Type> <MethodName> (<FArgs>) where <Accessibility> is one of {private , public , protected} , <Type> is a type listed in (5), and <FArgs> is a list of formal arguments.
7. Variable declarations for variables of the types listed in (5) above.
8. Method invocations of the form
<MethodName> ( <Args>) where <MethodName> is the name of a method, and <Args> is a list of arguments.
9. Assignment statements.
10. Return statements: both simple, and returning a value.
11. Conditionals, either of the form 'if-then' or of the form 'if-then-else'.
12. Iterative loops of the forms 'while', 'do ... while' or 'for'.
13. Switch statements.
14. In addition to those already mentioned, the following types of expressions: a) Arithmetic expressions; b) Relational expressions; c) Array element expressions; d) 'new' expressions.
2.2 Details of a Single-Processor SpeechJava Environment
SpeechJava programs conforming to the above description can be developed and run in any normal Java environment equipped with an implementation of the SpeechIO and RecStructure classes. One easy way to implement the RecStructure class (which can potentially be incorporated into the SpeechIO class) is as an extension of Hashtable. The SpeechIO class can be realised by first implementing a server (corresponding to the voice platform 303 of Figure 3) that can carry out the basic speech input and output operations. This server can be built on top of any standard speech recognition platform, for example a toolkit from Nuance Corporation (see above). This server provides a minimal speech application, whose top-level loop reads messages from an input channel and acts on them in an appropriate way. Messages will be of the following three types:
1) Play audio file.
2) Perform TTS on specified string.
3) Perform recognition using specified grammar.
In the third case, the server sends a return message over an output channel containing either the recognition result, or a notification that recognition failed for some reason. The SpeechIO class itself (302 in Figure 3) can be implemented as a client to the server described above. Each of the three SpeechIO methods then functions by sending an appropriate message to the server, and in the case of the Recognise message also waiting for a return value.
2.3 Details of a SpeechJava to Static VoiceXML Compiler
This section describes one embodiment of a compiler that converts annotated SpeechJava programs conforming to the framework of Section 2.1 above into equivalent VoiceXML programs. The annotations for indicating whether code is to be performed on the server or on the client are provided as a separate input file. The compiler performs this conversion as a sequence of three main steps:
1. Parsing of the source Java code and the annotations file into internal format.
2. Compilation of the internalised Java code into abstract (internalised) VoiceXML.
3. Rendering of abstract VoiceXML into executable VoiceXML.
The first stage is performed in one embodiment using known methods. Thus in the compiler of this particular embodiment, the parsing of the Java code to internal form is done using a version of the freeware ANTLR parser-generator, together with the accompanying freeware ANTLR grammar for Java ('www.antlr.org'). As regards the annotations file, this has a simple structure and can be parsed using simple ad hoc methods.
The third stage above renders abstract VoiceXML into executable VoiceXML. Thus the compiled VoiceXML is initially in abstract or internalised form for easier manipulation (similar to the internalised form of Java produced at step 410 in Figure 4). It will be appreciated that techniques for converting back from this abstract form into a final executable form are well-known in the art, and accordingly will not be described in further detail herein.
For the remainder of this section, we will now focus on the second processing stage, which is structured as a syntax-directed translation algorithm. For each of the SpeechJava constructs listed in Section 2.1 above, we describe how that construct is translated into abstract VoiceXML, and provide examples.
The greater part of the algorithm is encoded as two recursive functions, which respectively translate statements and expressions. We will call these 'translate_statement' and 'translate_expression'. Translate_statement takes the following arguments:
1. The input Java fragment;
2. The current list of local variables, including formal parameters;
3. The current list of ancillary <form> elements resulting from translation of conditional and loop constructs. Translate_statement then returns the following outputs:
1. The output abstract VoiceXML fragment;
2. The list of local variables, including formal parameters, after translating the fragment;
3. The list of ancillary <form> elements resulting from translation of conditional and loop constructs, after translating the fragment.
Since VoiceXML, in contrast to Java, does not permit expressions to contain subdialog calls, Java expressions in general translate into two components: a VoiceXML expression, and a list of zero or more statements which are executed before the statement in which the target expression appears. If this list is non-empty, this is generally because it contains a <subdialog> item. Translate_expression consequently takes the following arguments:
1. The input Java fragment;
2. The current list of extra statements resulting from translation of method invocations inside the expression; 3. The current list of local variables, including formal parameters;
4. The current list of ancillary <form> elements resulting from translation of conditional and loop constructs.
Translate_expression then returns the following outputs: 1. The output abstract VoiceXML fragment;
2. The list of extra statements resulting from translation of method invocations inside the expression, after translating the fragment;
3. The list of local variables, including formal parameters, after translating the fragment;
4. The list of ancillary <form> elements resulting from translation of conditional and loop constructs, after translating the fragment.
2.3.1 Grammar declarations
Each recognition grammar used as an argument to an invocation of SpeechIO . recognise is the subject of a declaration using a line of the form grammar grammar {slot_l, slot_2, ..., slotji} where grammar is the name of the grammar as it appears in the invocation of SpeechIO . recognise, and slot_l, ... slot τ are the names of the slots filled by grammar.
Each grammar declaration is translated into a <form> item corresponding to a recognition call involving the defined grammar. The declaration above translates into a <form> with the following appearance:
<form id="recognition_subdialog_for_grα»!/?2ύr"> <var name="rec_status" expr="'ok"7> <grammar src="grammars/all_grammars.gsl#grα?w/nαr"/>
<catch event="nomatch">
<assign name="rec_status" expr="'nomatch,"/> <assign name=".s/ot_7" expr="'undefined"'/>
<assign name="_7ot_«" expr='"undefined'"/> <return namelist=" rec_status slot_l ... slotji "/> </catch>
<catch event="noinput">
<assign name="rec_status" expr='"noinput'"/> <assign name="_7ot_7" expr="'undefined"7> <assign
Figure imgf000029_0001
expr='"undefined"7>
<return namelist="rec_status slot_l ... slotji "/> </catch> <field modal="false" hotword="false" name="~ ot_7'7> <field modal="false" hotword="false" name=" slot_n"/>
<filled mode- 'any">
<if cond="( ( typeof value ) == 'undefined' )"> <assign name="value" expr='"undefined'"/> </if <return namelist="rec_status slot_l ... slotji"l>
</fιlled> </form>
The intent is that invocations of the form: SpeechIO.recognise(gra/Mwαr) will be translated into <subdialog> items for recognition_subdialog_for_ *ra/w?2ar. The object returned will have n+1 properties named rec_status, slot_l ... slotji. The value of rec_status will be "ok" if recognition succeeded, "noinput" if no spoken input was detected, and "nomatch" if spoken input was detected but recognition failed. If rec_status is "ok", then the properties slot_l ... slot i will be filled with the values of the grammar slots slot_l ... slotji, where these exist, or else by "undefined". If rec_status is not "ok", then all of slotj ... slotji will be filled with "undefined".
2.3.2 Inner class
In this embodiment definitions of inner classes are internalised and stored for use in translating other constituents, but are not directly translated into output VoiceXML.
2.3.3 Static method definition
A static method definition is translated into one or more <form> elements, as follows:
1. The formal parameters are translated into <param> elements as described in the following subsection (2.3.4).
2. The body of the method is translated into abstract VoiceXML, potentially including one or more ancillary <form> elements resulting from translation of conditional or loop elements. The list of current local variables is initialised to the list of formal parameters.
3. If necessary, a "return type declaration" is created of the form: <var name="return_type" expr='"none"7>
4. If necessary, a "return value declaration" is created of the form: <vaf name="return_value"/> 5. The translated method definition is a <form> element of one of the following patterns: a. If this is the main method, then the translation is: <form id- functio jιame">
(translation of formal parameters) (translation of body) </form> b. Otherwise, if this is a method with a void return value, the translation is:
<form id- 'function jιame"> (translation of formal parameters) (return type declaration) (translation of body) <return>
</form> c. Otherwise, the translation is:
<form id= "function jιame"> (translation of formal parameters)
(return type declaration) (return value declaration) (translation of body) </form>
2.3.4 Formal parameters
Each formal parameter parameter jiame is translated into a <var> item of the form: <var mme="parameterjιame"/>
2.3.5 SpeechIO methods
The SpeechIO methods recognise, sayWavfile and sayTTS are translated as follows: recognise A method invocation of the form:
SpeechIO.recognise(grα7?z/wα/") is translated into a VoiceXML fragment of the form:
<subdialog src="#recognition_subdialog_for_jgrα/«?«α " name- 1 'subdialog_l "/> where subdlalog_l is a new subdialog identifier.
sayWavfile
A method invocation of the form: SpeechIO.sayWavfile(wσv_/?/e_5?π'«g) is translated into a VoiceXML fragment of the form: <audio src=" wavjilejtήng" l>
say tts
A method invocation of the form: SpeecMO.sayTTS(ex/?rø.j/o«) is translated into a VoiceXML fragment of the form: <audio src="(translation of expression) l>
2.3.6 Declarations
In this embodiment Declarations are translated into <var> items. If initial values are supplied, these are either translated into 'expr' attributes, or into assignments on the newly defined variables. The following examples illustrate how the translation is carried out: int i;
StringQ prompts; translate to:
<var name="i"/>
<var name="prompts'7> and: int i=0; translates to:
<var name="i" expr="0">
2.3.7 Method invocations
In this embodiment method invocations are translated into <subdialog> items, but the form of the translation depends on whether the method invocation occurs in a 'statement' or an 'expression' context. If the method invocation appears as a statement, then it is directly translated as a <subdialog> item. If the method invocation however is part of an expression, it is translated in the output VoiceXML expression as the expression : subdialog_l .return_value where subdialogj is a newly generated identifier, and the <subdialog> item is added to the output list of 'extra statements' produced by the relevant call to translate_expression. In both cases, the list of actual parameters to the method invocation is translated using translate_expression, and the resulting output list of 'extra statements' is added to the current list of 'extra statements'.
If we write function for the function name, arg_l, ... argji for the formal parameters and val_l, ... valji for the actual parameters, the output VoiceXML has the following form: <subdialog src- '# function" name="s«ό< /α/og_7">
<param name="ar,g_7" expr=" val_l"l>
<param name="arg_n" expr=" valjι"l> </subdialog>
2.3.8 Conditionals VoiceXML syntax does not permit insertion of a <subdialog> element into an <if> element, but it is possible to give a <subdialog> element a 'cond' attribute which makes execution of the call conditional on the value that the attribute evaluates to at run-time. The compiler therefore translates Java conditionals by exploiting this fact. The general issues involved have already been described in Section 1.3 above.
The basic strategy is to define two new subdialogs, (call them cond_sub_l and cond_sub_2), that respectively encode the 'if and 'else' branches of the conditional. The compiler then recodes the conditional in terms of conditional subdialog calls by introducing a new variable whose value is set depending on the result of the conditional's test. This is followed by calls to cond_sub_l and cond_sub_2 conditional on appropriate values of the branch variable.
Considering this in more detail, suppose that the original conditional is: if condition then body I else body2 and it occurs in a context where the local variables defined are v_l ... vji. For bodyl, we define a <form> item cond_sub_l (where cond_sub_l is a new identifier) as follows:
<form ά- ' cond_sub_r>
Figure imgf000032_0001
<var name="v_»'7>
<var name="return_tyρe" expr="'none"7> <var name- 'return_value7> (translation of bodyl) <block>
<return namelist="return_type v_7... vji " /> </block> </form> where bodyl is translated normally except that simple 'return' statements are translated as:
<assign name="return_type" expr='"no_value"7> <return namelist="return_type v_l ... vji "/> and statements of the form 'τetam(return_expry are translated as:
<block>
<assign name="return_type" expr='"value"7> <assign name="return_value" xpτ="retιιrn_expr"/> <return namelist="return_type return value v_7... vji "/>
</block>
A similar <form> item cond_sub_2 is created to encode bodyl. Finally, the conditional expression itself is realised as a VoiceXML fragment of the following form, where branch _choice_l , subdialog_ l and subdialog_2 are suitable new constants:
<var name- ' branch _choice_l" expr='"none"7> <block>
<if cond="( translation of condition )">
Figure imgf000032_0002
<else
<assign name- ' branch_choice_l" expr='"second_branch"7> </if </block> <subdialog sτc="#cond_sub_l" name- 'subdialog_l" cond="( branch_choice_l —
'first_branch' )"> <param name="v_7" expr=" v_l"/>
<param name- 'v ji" expr=" vjι"/> </subdialog>
<block>
<if cond="( branch_choice_l == 'second_branch' )"> <assign name="v_7" expr="subdialog_l . v_7"/> <assign name="v_«" expr=" 'subdialog_l '. vjι"/>
</if
<if cond="( ( branch_choice_l = 'second_branch' ) &amρ;&amp; ( subdialog_l setam_type != 'none' ) )"> <assign name="return_type" expτ=" subdialog_l seturn_ type"/> <if cond="( return_type == 'no_value' )">
<return namelist="return_type"/> <else/> <assign name="return_value" expr="sκέώ'α/og_7.return_value"/> <return namelist="return_type return_value"/> </if>
</if </block>
<subdialog sτc- '#cond_sub_2" name-" ubdialog_2" cond="( branch_choice_l == 'second_branch' )"> <param name="v_7" expr=" v_7"/>
<param name="v_«" expr=" vjι"/> </subdialog> <block> <if cond="( branchj:hoice_l == 'second_branch' )">
<assign name- ' v_7" expr="subdialog_2. v_7'7>
<assign name=" vji" expτ="subdialog_2. vjι"/> </if> <if cond="( ( branch _choice_l == 'second_branch' ) &amp;&amp;
( subdialog_2. eturn_type != 'none' ) )"> <assign name="return_type" expr="sMfc iα/og_2.return_type"/> <if cond="( return_type == 'no_value' )"> <return namelist="return_type"/> <else/>
<assign name="return_value" expr="5κέώ' /og_2.return_value"/> <return namelist="return_type return_value"/> </if </if> </block>
2.3.9 Switch
Switch statements are simply translated into conditionals, following which the rules for translation of conditionals are applied.
2.3.10 While Since it is not possible to encode iteration explicitly in VoiceXML, 'while' loops and other iterative constructs must be encoded using recursion. This is done using techniques similar to those described in subsection 2.3.8 above for the case of conditionals. The basic approach is to construct a new <form> element (call it while_sub_l) that encodes the body and test of the loop, and concludes by calling itself recursively; the actual loop can then be realised as a call to while_sub_l. 'Return' statements are translated specially in the manner described in subsection 2.3.8.
Suppose that the original loop is: ψhile(condition, body) and it occurs in a context where the local variables defined are v_7... vji. We define a <form> item while_sub_l (where while sub_l is a new identifier) as follows, where subdialog_l is a suitable new constant:
<form id="while_sub_l"> <var name="v_7"/>
<var name="v_«"/>
<var name="return_type" expr='"none"7> <var name="return_value"/> <block>
<if cond- '!( translation of condition)">
<return namelist="return_type v_7 ... vjι"l> </if> </block> (translation of body)
</block>
<subdialog sτc="#while_sub_l" name=" ' subdialog_l"> <param name="v_7" expr=" v_7"/> <param name="v_«" expr=" v "/>
</subdialog> <block>
<assign name="v_7" expr="subdialog_l . v_7"/> <assign name="v__«" expr=" 'suhdialog_l . vjι"/>
<if cond="( 5Mέώ' /og_7.return_type != 'none' )"> <assign name="return_type" expι-=".raMzα/og_7.return_type > <if cond="( return_type == 'no_value' )"> <return namelist="return_type v_7... vji "/> <else/>
<assign name="return_value" expr=".raMα/og_7.return_value'7> <return namelist="return_type return_value v_l ... vji "/> </if> </if> <return namelist="return_type v_7 ... vjfl>
</block> </form>
The actual occurrence of the loop is translated as follows, where subdialog_2 is a suitable new constant:
<subdialog src="#while_sub_l" name="subdialog_2"> <param name="v_7" expr=" v_7"/> <param name="v_«" expr=" v_«" > </subdialog> <block> <assign name="v_7" expτ="subdialog_l. v_7"/>
<assign name="v_«" expτ="subdialog_l . vjι"l> <if cond="( subdialog_2.return_type != 'none' )"> <assign name- 'return_type" expr="subdialog_2.return_type"/> <if cond="( return_type == 'no_value' )"> <return namelist="return_type"/>
<else/> <assign name="return_value" expr="subdialog_2.return_value"/> <return namelist="return_type return_value"/> </if> </if>
</block>
2.3.11 Do - while
'Do - while' loops are translated similarly to 'do' loops, except that the conditional return occurs after the body rather than before.
2.3.12 For
'For' loops are rewritten into 'while' loops in the standard way and translated using the methods described in subsection 2.3.10 above.
2.3.13 Assignments
Assignment statements are translated as appropriate either into VoiceXML <assign> elements, or into ECMAScript (JavaScript) assign statements wrapped in <script> elements. The <assign> element is used if the left-hand side of the assignment is a simple variable, and the <script> element otherwise. For example, the Java assignments: i = i + l; and table[i]. count = table[i].count + 1 are translated respectively into:
<assign name="i" expr="( i + 1 )"/> and <script>
<![CDATA[ tablefi], count = ( table[i]. count + 1 );
]]> </script>
2.3.14 Return The Java return statement is by default translated by the VoiceXML <return> tag. When the 'return' expression occurs in the body of a conditional or an iterative loop, the translation of 'return' is treated specially in the manner described above in subsection 3.3.8.
2.3.15 Variables
Java variables are translated into VoiceXML variables.
2.3.16 Literals
Java numerical literals are translated as VoiceXML numerical literals. Java string literals are translated as VoiceXML string literals. The special Java constant 'null' is translated as the VoiceXML string "undefined".
2.3.17 Data member expressions
Java data member expressions of the form: class Jnstance.datajnemberjiame are translated as ECMAScript object property references of the form: class Jnstance.datajnemberjiame
2.3.18 Array element expressions
Java array element expressions of the form: array _instance[index] are translated as ECMAScript array element expressions of the form: array_instance[index]
2.3.19 Arithmetic expressions
The Java arithmetic operators '+', '-', '*', 7' and '%'are translated into the ECMAScript arithmetic operators of the same names.
2.3.20 String operator expressions
The Java string concatenation operator '+' is translated into the ECMAScript operator of the same name.
2.3.21 Relational expressions
The Java relational operators and connectors are translated as follows:
Figure imgf000037_0001
2.3.22 'New' expressions
Invocations of the Java 'new' operator translate to the following sequence of VoiceXML statements:
1. Definition of a unique temporary variable (call it tmp ar), to hold the newly created object.
2. An assign statement that initialises tmp ar to the result of using the ECMAScript 'new' operator to create a new ECMAScript array.
3. Two more assign statements to associate suitable values in tmp ar with the two reserved keys 'structure_type' and 'identity ag'; the issues are explained in more detail below.
4. If any init values are supplied, further assign statements to translate these values.
The following example illustrates the code produced from a 'new' expression. The original Java expression is: new String[] {"foo", "bar"} creating a new String array with two elements initialised to "foo" and "bar". This results in the
VoiceXML code:
<var name="tmp_var_l "/> <block> <script> <![CDATA[ tmp_var_l = new Array () tmp_var_l.structure_type = '2:String' tmp_var_l.identity_tag = new_identity_tag() tmp_var_l [0] = 'foo' tmp var_l[l] = 'bar'
]]>
</script> </block>
Here, tmp_var_l is the new temporary variable; the value associated with the structurejype key encodes the information that tmp_var_l is an array of two objects of type String; and the value associated with the identity_type key is a unique new tag.
2.3.23 Translating into ECMAScript Most commercially available VoiceXML interpreters are fairly inefficient. However, ECMAScript (more commonly known as JavaScript; see http://www.ecma.ch/ecmal/STAND/ECMA- 262.HTM for the formal specification) is officially included as a part of the VoiceXML standard, and embedded ECMAScript code tends to execute much more quickly than the corresponding plain VoiceXML. It is consequently desirable where possible to translate SpeechJava code into ECMAScript code, rather than into VoiceXML code. The basic criterion here is that a SpeechJava method cannot be translated into ECMAScript if it includes calls to SpeechIO primitives, since speech operations can only be carried out within a VoiceXML form, and JavaScript functions have no mechanism for calling VoiceXML forms.
In view of the above, the compiler carries out a static analysis of the input SpeechJava program to determine those method definitions that will be performed on the client side and that can be compiled into ECMAScript. In order to do this, the compiler first constructs a call graph, in which the client side methods are nodes and method invocations are arcs. Methods are then labelled as being either 'VoiceXML' or 'ECMAScript', as follows:
1) Initially, all methods are unlabelled.
2) All methods that call SpeechIO primitives as labelled 'VoiceXML'.
3) All methods that call methods already labelled as 'VoiceXML' are also labelled 'VoiceXML'.
4) Step (3) is repeated until a fixed point is reached. 5) All remaining methods are labelled 'ECMAScript'.
Methods labelled 'VoiceXML' are translated as described earlier in this section, and the remaining methods can be translated into ECMAScript. This translation of Java into ECMAScript is reasonably straightforward (compared to the translation into VoiceXML per se), and in broad terms involves the translation of Java static methods into ECMAScript function definitions. The skilled person will then be able to map Java control primitives and operators into ECMAScript control primitives and operators without undue difficulty. In addition, Java data structures can be mapped into the same ECMAScript data structures as in the VoiceXML case.
A table is utilised to keep track of whether each method is realised as VoiceXML code or
ECMAScript, so that method invocations can be correctly translated as <subdialog> elements or ECMAScript function calls respectively.
2.4 Details of a Compilation of SpeechJava into Dynamic VoiceXML
This section describes in detail one embodiment of a compiler for implementing the method sketched in Section 1.4 above in order to transform annotated SpeechJava programs into Java-based dynamic VoiceXML programs. The annotations specify which Java methods are to be executed on the server (hence remaining as Java), and which are to be executed on the client (hence being compiled in VoiceXML). The convention used is that the annotations explicitly specify a set of zero or more methods that are to transfer processing to the client. The transitive closure of this set is then run on the client.
Annotations are supplied in a separate file, where each line has the format: execute_on_client class. method/ arity
Here, class is the name of a class, and method is the name of a method in that class with arity arguments. For example, a typical line might be: execute_on_client get_a_numberl .hello/0
The code produced by the compiler comprises a Java program to be executed on the server, together with a set of one or more pieces of VoiceXML to be executed on the client. The top-level steps carried out by the compiler have already been described in Section 1.4 above (see also Figure 4). We now discuss each of these steps in more detail for one particular embodiment.
2.4.1 Internalisation of the source code and annotations
The Java code is internalised using the ANTLR parser for Java referred to in Section 2.3 above, and the annotation file is internalised using straightforward ad hoc methods. Methods listed in the annotations file are marked in a table as execute jm lient methods; these are the methods that are to transfer processing to the client.
2.4.2 Construction of the call graph
The call graph is constructed by recursively traversing the internalised Java code and noting the following: a) Instances of invocations of method M_l inside method definition M_2, for some M_l, M_2. In this case, an arc of the form calls(M_2, M_l) is added to the call graph; b) Instances of invocations of the form SpeechIO .recognise(G_l) inside method definition M_2, for some named grammar G_l. In this case, an arc of the form uses_grammar(M_2, G_l) is added to the call graph.
After the graph has been constructed, it is pruned by removing all arcs of the form calls(M_l, M_2) where M_2 is not a method defined in the source Java code.
2.4.3 Use of the call graph to separate the methods into groups
The call graph and the execute_on_client table developed above are used to divide the set of methods into three groups. First define the relation: server_side_call(M_l, M_2) so that server_side_call(M_l, M_2) holds iff call(M_l, M_2) holds and M_2 is not an execute_on_client method, and let server_side_call* be the reflexive and transitive closure of server_side_call. Then we divide up the set of methods as follows:
1. The server-side methods consist of the set of all methods M such that server_side_call*(main,
M) holds, where 'main' represents the main method. These methods will stay as Java. 2. The execute )tι_client methods comprise those methods that specifically transfer control from server to client, and must treated specially (see below).
3. The client-side methods consist of the remaining methods. These methods are to be translated into VoiceXML.
2.4.4 Translation of the client-side code into VoiceXML
For each client-side method M, as defined in subsection 2.4.3 above, the call graph is used to compute the transitive closure of M as follows:
1. Find all methods M' such that call*(M, M') holds, where call* is the reflexive and transitive closure of the 'call' relation.
2. Find all grammar identifiers G such that uses_grammar(M', G) holds, where M' denotes the methods found in step 1.
For each M, this transitive closure is translated into VoiceXML using the SpeechJava to static VoiceXML compiler, using the appropriate declaration for each of the grammars G found in step 2. The result is written out to a file; call this file client_side_code(M).
2.4.5 Translation of execute_on_client methods
For each execute_on_client method M, a corresponding "proxy" method M' is created. Let the signature of the original method be: return Jype M(typej) f irgj), type f irgj , ... typejifjxrgji) where the return type of the method is return Jype, the names of the arguments are f irgj) ... f trgji, and their types are typej) ... typeji. As in subsection 2.4.4, let the name of the file that contains the client-side code for M be client jide ode(M).
For i = 0 ... n, define arg_i' as follows: If type J is 'int', then arg_i' is the expression new Integer(arg_i). Otherwise, arg_i' is arg_i.
The intent is that arg__i' should be always be a proper object, as opposed to a primitive type.
The proxy method M' is then realised as follows: return Jype M(typeJ) arg__0, type arg_l, ... type arg_n) { StringD f_args = {"f_αrg_0", Hf_αrg_l", ... uf_αrg_ n"};
Objectf] args = {arg_0\ arg_ T, ... arg_n'}; gatewayServer. convertArgsToVXML("/*etHr«_t)>pe", "M", f_args, args); return gatewayServer.sendVXML('Vetw«j >e", " client jide ode(M) ");
} The motivations for this definition will be explained below, when the class gatewayServer and its associated methods are discussed. Briefly, the code performs the following functions:
1. gatewayServer.ConvertArgsToVXML takes as input the arguments of M, packaged as the Object array args, the names of the formal arguments, packaged as the String array f_args, the name of the return type, and the name of the method M itself. It uses this information to generate a piece of VoiceXML, which calls the code in client ide ode(M). In general, this involves creating client-side objects that correspond to the server-side objects in the arguments of M, so it also stores a table that associates each client-side object with its corresponding server-side object.
2. gatewayServer.sendVXML combines the VoiceXML code produced in step 1 with the precompiled client-side code in client jide_code(M), and renders this out to the client. It then waits for the client to perform a new submit which will contain the returned information. When this is received, it decodes it and if necessary updates server-side objects and/or computes a return value.
2.4.6 Transformation of the server-side code
For all the server-side methods, the following steps are performed:
1. Replace each call to an execute_on_client method with a call to the corresponding proxy method created as described in the preceding subsection 2.4.5.
2. Replace calls to SpeechIO primitives with calls to corresponding proxy methods.
With regard to step 2, a call to SpeechlO.sayTTS would for example be replaced by a call to the proxy method execute_client_sayTTS, defined as follows:
void execute_client_sayTTS(String arg_0) {
String[] f_args = {"s"}; Object[] args = {arg_0}; gatewayServer.convertArgToVXML("void", "sayTTS", f_args, args); gatewayServer.sendVXML("void", "sayTTS_l .vxml"); } where the file sayTTS_l.vxml contains the following code:
<form id="sayTTS"> <var name="s"/> <block> <value expr="s"/>
<return/> </block> </form> 2.4.7 Write out server-side code in standard Java syntax
In one embodiment, for certain reasons that are explained below in subsection 2.5.1, the server- side code is subjected to a final transformation, which for each top-level class class does the following: 1. class is made to extend the class 'GatewayRunnable'.
2. class is provided with a private data member called 'gatewayServer', as follows: private GatewayServer gatewayServer = null;
3. class is provided with a public method called 'setSharedMemory', as follows: public void setSharedMemory(SharedMemory newSharedMemory) { gatewayServer.setSharedMemory(newSharedMemory); } 4. class is provided with a constructor with no arguments, as follows: public classQ { super(); gatewayServer = new GatewayServer(); }
5. class is provided with a public method called 'run', as follows: public void run() { main_jproxy(); gatewayServer.end(); }
6. The name of the 'main' method of class is changed to 'main proxy'. 7. All methods for class are made public.
The server-side methods are then rendered out in standard Java syntax using a simple recursive descent algorithm.
2.5 Details of Execution of Compiled SpeechJava in a Dynamic VoiceXML Environment
This section provides details on how the high-level flow of control in the runtime environment described in Section 1.5 above (see also Figure 5) can be implemented in a manner consistent with the particular compiler embodiment of the preceding section, and explains the significance of the extra processing steps described in subsection 2.4.7.
The implementation includes a GatewayRunnable class and the Gateway. Server-side programs are classes that implement the GatewayRunnable interface. Communication between the voice-browser client and the server program first goes to a Tomcat Servlet (see http://jakarta.apache.org/. as previously mentioned); communication between the Tomcat Servlet 543 and the server-side program 510 is through the Gateway process 540 (see Figure 5). The top-level modules in the Gateway are the following:
The Gate way Client class.
The FileClassLoader class. The GatewayServer class.
The SharedMemory class. Execution starts when the servlet accepts a URL request to run a new program, specified as a string program. The server passes this request to the GatewayClient 542, which invokes the FileClassLoader to locate and run a class named program. Since this class extends GatewayRunnable, which in turn extends Runnable, program can be started as a separate thread.
Once it has been started, the instance of program communicates with the voice-browser client through its private instance of GatewayServer 542; this communicates with the GatewayClient 542, which in turn communicates with the Tomcat Servlet 543. The GatewayServer and GatewayClient pass information through an instance of the SharedMemory class.
The proximate interface between the instance of program and the Gateway consists of the two GatewayServer methods, convertArgsToVXML and sendVXML:
1. convertArgsToVXML takes the run-time information pertaining to the method invocation, and creates a small piece of VoiceXML that acts as a 'header' for the main piece of VoiceXML that has been produced and saved at compile-time. The server-side Java objects are translated into ECMAScript counterparts, and the correspondence between them is saved in a server-side table.
2. sendVXML combines the header information produced by convertArgsToVXML with the pre-compiled VoiceXML, and renders it out to the client. It then waits for the next client request, which should contain updated versions of the objects that have been sent from the server, possibly together with a return value. The server-side correspondence table is used to translate the data back into server- side form, update server-side objects where appropriate, and if necessary produce a server-side return value.
3. Examples
This section presents an illustrative example of a SpeechJava program and its translation into static and dynamic VoiceXML. The program itself is presented in Section 3.1, and the static VoiceXML translation in Section 3.2. Section 3.3 then presents a dynamic VoiceXML translation. Section 3.4 describes the run-time processing carried out by the dynamic VoiceXML program.
3.1 Example Program
The example program is a Java class called fmd_largest_number, which contains four static methods. The program uses text to speech (TTS) conversion to prompt the user for three numbers, finds the largest one, and speaks the result. The way in which this is done has been chosen to display many of the features of the SpeechJava language and its compilation into static and dynamic VoiceXML. The text of the program follows, with comments in italics:
//Import the RecStructure and SpeechIO packages import com.netdecisions.cambridge.zozma.api.SpeechIO; import com.netdecisions.cambridge.zozma.api.RecStructure; class find_largest_number
{ public static void main(String args[])
{ // The numbers to be input are stored in the array 'a '. The user is prompted using
// the strings in the array 'prompts ' int[] a = new int[2]; String[] prompts = new String[]
{"Say a number", "Say another number", "Say the last number"}; SpeechIO.sayTTS("Hello");
// The arrays 'a ' and 'prompts ' are passed to getj umber irray. // On return, the intention is that 'a ' will be suitably updated. // The third argument is the size of the array. get_number_array(a, prompts, 3); SpeechIO.sayTTS("The largest number is " + max_array(a, 3));
SpeechIO.sayTTS("Goodbye"); } private static void get_number_array(int[] numbers, Sfringf] prompts, int size) { for ( int i = 0; i < size ; i++ ) {
//For each element in the array, prompt to get the number and confirm numbers[i] = get_number(prompts[i]);
SpeechIO.sayTTS("Component " + i + "I heard " + numbers[i]); }
SpeechIO.sayTTS("Got all the numbers"); } private static int max_array(int[] numbers, int size) { int max = numbers [0]; for ( int i = 0; i < size ; i++ ) { if ( numbers[i] > max ) { max = numbersfi]; }
} return max;
} private static int get_number(String prompt)
{
RecStructure rec esult; SpeechΙO.sayTTS(prompt);
// Call recognition using the grammar ".NUMBER " to get the number recjresult = SpeechIO.recognise(".NUMBER");
// If recognition status is ok, return the 'value' slot, else recurse to try again if ( rec_result.getSlotValueAsSfring("rec_status").equals("ok") ) { return rec_result.getSlotValueAsInt("value") ; } else { SpeechIO.sayTTS("Sorry, I couldn't interpret that as a number."); return get_number(prompt); }
3.2 Translation into static VoiceXML This section presents the result of compiling the whole example program into static VoiceXML. A grammar declaration is provided containing the single line grammar .NUMBER {value} declaring that the grammar '.NUMBER' has one slot called 'value'.
The VoiceXML code produced is as follows, where once again comments have been added in italics:
Header information, generated for every program <?xml version="1.0" encoding="ISO-8859-l"?>
<!DOCTYPE vxml PUBLIC '-//Nuance/DTD VoiceXML 1.0//EN' 'http://voicexml.nuance.com/dtd/nuancevoicexml-l-2.dtd'>
<vxml version- ' 1.0">
Definition of the global ECMAScript function newjdentityjag, used to assign a unique new tag to each object created. Also generated for every program. <script> <![CDATA[ var identity_tag_counter = 10000 function new_identity_tag() { return identity_tag_counter++ }
]]> </script> Translation of 'main ' method.
<form id="main">
Translation of declaration of array 'a' (cf Sections 2.3.6 and 2.3.22). Note that the elements are initialised to the value 'undefined'. <var name="a"/> <var name="tmp_var_l "/>
<block> <script> <![CDATA[ tmp_var_l = new Array() tmp_var_l.structure_type = '3:int' tmp_var_l .identity_tag = new_identity_tag() tmp_var_l [0] = 'undefined' tmp_var_l[l] = 'undefined' tmp_var_l [2] = 'undefined' ]]>
</script>
<assign name="a" expr="tmp_var_l "/> </block>
Similar translation of declaration of variable 'prompt '. <var name="prompts"/>
<var name="tmp_var_2"/> <block> <script> <![CDATA[ tmp_var_2 = new Array () tmp_var_2.structure_type = '3:string' tmp_var_2.identity_tag = new_identity_tag() tmp_var_2[0] = 'Say a number' tmp_var_2[l] = 'Say another number' tmp_var_2[2] = 'Say the last number'
]]> </script> <assign name- 'prompts" expr="rmp_var_2"/> Translation of first invocation of SpeechIO. sayTTS. <value expr='"Hello"7> </block> Translation of invocation ofgetjiumber irray
<subdialog src="#get_number_array"> <param name="numbers" expr="a"/> <param name="prompts" expr="prompts"/> <param name="size" expr="2"/> </subdialog>
<block>
Translation of second and third invocations of SpeechIO. sayTTS. Note that the call to max xrray has been translated into a call to an ECMAScript function <value expr="( 'The largest number is ' + es_fun_max_array( a, 3 ) )"/> <value expr='"Goodbye"7>
</block> </form>
Translation of method 'get umber xrray ' <form id="get_number_array">
<var name="numbers"/> <var name="prompts"/> <var name="size"/>
<var name="return_type" expr='"none"7> Translation of declaration for local variable 'i ', including initialisation to 0.
<var name="i"/> <block>
<assign name="i" expr="0"/> </block> Translation of for' loop as recursive invocation of form 'whilejsub ' (cf
Sections 2.3.10 and 2.3.12).
<subdialog src="#while_sub_l " name="subdialog_3"> <param name="i" expr="i"/> <param name="numbers" expr="numbers"/> <param name="prompts" expr="prompts"/>
<param name="size" expr="size"/> </subdialog> <block>
<assign name="i" expr="subdialog_3.i"/> <assign name="numbers" expr="subdialog_3. numbers "/>
<assign name="prompts" expr="subdialog_3. prompts "/> <assign name="size" expr="subdialog_3.size"/> <value expr='"Got all the numbers"7> <return > </block>
</form>
Definition of internally generated form 'while ub_l ', constituting body offor-loop. Note the inclusion of the local variable ϊ' as a parameter (cf Section 2.3.10). <form id="while_sub_l ">
<var name="i"/> <var name="numbers"/> <var name="prompts"/> <var name="size"/> <block>
<ifcond="!( ( i &lt; size ) )">
<return namelist="i numbers prompts size"/> </if </block> <subdialog src="#get_number" name="subdialog_l "> <param name="prompt" expr="prompts[i]"/> </subdiaIog> <block> <script <![CDATA[ numbers[i] = subdialog_l.return_value ]]> </script> <value expr="( ( ( 'Component ' + i ) + T heard ' ) + numbers[i] )"/>
<assign name="i" expr="( i + 1 )"/> </block>
Recursively call 'while jub_l ', passing in the current value of all the local variables <subdialog src="#while_sub_l " name="subdialog_2">
<param name="i" expr="i"/> <param name- 'numbers" expr="numbers"/> <param name="prompts" expr="prompts"/> <param name="size" expr="size"/> </subdialog>
Recover the values of all the local variables at the end of the call. <block> <assign name="i" expr="subdialog_2.i"/> <assign name- 'numbers" expr="subdialog_2.numbers"/> <assign name- 'prompts" expr="subdialog_2.prompts"/>
<assign name="size" expr="subdialog_2.size"/> Pass out the values of the updated local variables. <return namelist="i numbers prompts size"/> </block> </form>
Translation of the method 'maxjirray'. Since this contains no explicit or implicit calls to speech primitives, it has been turned into an ECMAScript function for efficiency reasons (cf Section 2.3.23). <script>
<![CDATA[ function es_fun_max_array( numbers, size ) { var max max = numbers [0] var i i = 0 while ( ( i < size ) ) { if ( ( numbers [i] > max ) ) { max = numbers[i] } i++
} return max
} ]]>
</script>
Translation of the method 'get umber'. <form id="get_number"> <var name="prompt"/>
<var name="return_type" expr="'none"'/>
<var name="return_value"/>
<var name="rec_result"/>
<block> <value expr="prompt"/>
</bIock>
Call the generated form 'recognition jubdialog orjf UMBER ', which carries out recognition using the grammar '.NUMBER '. The result is passed back as the property 'return alue ', and assigned to the local variable 'recjesult' (cf. Section 2.3.1). <subdialog src="#recognition_subdialog_for_NUMBER" name="subdialog_4"/> <block>
<assign name="rec_result" expr="subdialog_4.return_value"/> </block> Define a variable to encode the choice of branch in the conditional (cf. Section
2.3.8).
<var name="branch_choice_ l " expr='"none"7> <block>
The first branch involves only items whose translation can be included in the scope of an <if> form, so their translations can be left in place.
<if cond="( rec_result.rec_status = 'ok' )"> <assign name="return_type" expr='"value'"/> <assign name="return_value" expr="rec_result.value"/> <return namelist="return_type return_value"/> The second branch involves a procedure call, so its translation is realised by setting the branch variable and acting on the setting outside the scope of the
<if>-
<else/> <assign name="branch_choice_l" expr='"second_branch"7> </if>
</block>
If the second branch was chosen, conditionally invoke the generated form 'cond ub '. This once again involves passing local variables into the call and retrieving them afterwards. <subdialog src="#cond_sub_l" name="subdialog_6" cond="( branch_choice_l ==
'second_branch' )">
<param name="rec_result" expr="rec_result"/> <param name="prompt" expr="prompt"/> </subdialog> <block>
If we chose the second branch, retrieve local variables to get their new values. <if cond="( branch_choice_l == 'second_branch' )"> <assign name="fec_result" expr="subdialog_6.rec_result"/> <assign name- 'prompt" expr="subdialog_6.prompt"/> </if>
If we chose the second branch and there is a return value, retrieve it. <if cond="( ( branch_choice_l = 'second_branch' ) &amp;&amp; ( subdialog_6.return_type != 'none' ) )">
<assign name="return_type" expr="subdialog_6.return_type"/> If there is no return value, just send back the returnjype.
<if cond="( return_type == 'no_value' )">
<retura namelist="return_type"/> If there is a return alue, return both that and the returnjype. <else/> <assign name="return_value" expr="subdialog_6.return_value"/>
<return namelist="return_type return_value"/> </if> </if> <return/> </block>
</form>
Generated definition for the form 'condjub ', used in 'getjiumber ' above. <form id="cond_sub_l"> <var name="rec_result"/>
<var name="prompt"/> <var name="return_type" expr='"none"7> <var name="return_value"/> <block> <value expr='"Sorry, I couldnVt inte ret that as a number."7> </block>
<subdialog src="#get_number" name="subdialog_5">
<param name="prompt" expr="prompt"/> </subdialog> <block>
<assign name="return_type" expr='"value'"/>
<assign name- 'return value" expr="subdialog_5.return_value"/>
<return namelist="retura_type rec_result prompt"/> </block> </form>
Generated definition for recognition ubdialogJbrJfUMBER (cf Section 2.3.1). <form id="recognition_subdialog_for_NUMBER"> <var name="retura_value" expr="new Array()"/> <block>
<script> <![CDATA[ rerurn_value.rec_status = 'ok' rerurn_value.structure_type = 'feat_val_list' ]]>
</script> </block>
<grammar src="grammars/all_grammars.gsl#NUMBER"/> <catch event="nomatch"> <script>
<![CDATA[ return_value.rec_status = 'nomatch' return_value.value = 'undefined'
]]> </script>
<return namelist="return_value"/> </catch>
<catch event="noinput"> <script> <![CDATA[ return_value.rec_status = 'noinput' return_value.value = 'undefined'
]]> </script> <retura namelist="return_value"/>
</catch>
<field modal="false" hotword="false" name="value"/> <filled mode="any"> <if cond- '( ( typeof value ) == 'undefined' )"> <script> <! [CDATA[ return_value.value = 'undefined' ]]> </script>
<else/>
<script> <! [CDATA[ return_yalue.value = value ]]> </script> </if>
<return namelist="return_value"/> </filled>
</form>
</vxml>
3.3 Translation into dynamic VoiceXML
This section presents one possible result of compiling the example program into dynamic VoiceXML. The associated annotations file contains the one line: execute_on_client find_largest_number.get_number_array/3 This declares that the method get_number_array with three arguments is to be executed on the client side.
As in Section 3.2, there is also a grammar declaration to the effect that the grammar '.NUMBER' has one slot called 'value'. grammar .NUMBER {value} (Note that in one embodiment the grammar declarations and annotations are both included in the same compiler input file).
The output consists of a Java file, representing the code to be run on the server, and a VoiceXML file, representing the pre-compiled portion of the code to be run on the client. The Java file is as follows, with comments as before in italics:
// Import the gateway package import com.netdecisions.cambridge.zozma.gateway.*;
// The class implements GatewayRunnable (cf Sections 2.4.7 and 2.5) public class get_two_numbers implements GatewayRunnable {
// Standard introductory material (cf Sections 2.4.7 and 2.5) private GatewayServer gatewayServer = null; public find_largest_number() { super(); gatewayServer = new GatewayServer();
} public void setSharedMemory(SharedMemory newSharedMemory) { gatewayServer.setSharedMemory(newSharedMemory);
} public void run() { main_proxy(); gateway Server.end();
}
// Translated main method (cf Section 2.4.7) public void main_proxy() { int[] a = new int[2];
String[] prompts = new String[] {"Say a number", "Say another number"};
// Call to "proxy" method for SpeechIO. sayTTS (cf Section 2.4.6) execute_client_sayTTS_l("Hello");
// Call to "proxy" method for get jiumberjxrray (cf Sections 2.4.5 and 2.4.6) execute_client_get_number_array_3(a, prompts, 2); execute_client_sayTTS_l(("The largest number is " + max_array(a, 3) )); execute_client_sayTTS_l("Goodbye");
} // Method maxjirray stays the same, since it is run on the server side and
// involves no calls to speech primitives. public int max_array(int[] numbers, int size) { int max = numbers [0]; for ( int i = 0 ; ( i < size ) ; .++ ) { if ( ( numbers[i] > max ) ) { max = numbers[i]; } } return max;
}
// "Proxy" method for method for get jiumberjxrray (cf Sections 2.4.5 and 2.4.6) public void execute_client_get_number__array_3(int[] arg_0, Stringf] arg_l, int arg_2) {
Stringf] f_args = {"numbers", "prompts", "size"}; Objectf] args = {arg_0, arg_l, new Integer(arg_2)}; gatewayServer.convertArgsToVXML("void", "get_number_array", f_args, args); gatewayServer.sendVXML("void", "get_number_array_3.vxml"); }
// "Proxy" method for get TTS (cf Section 2.4.6) void execute_client_sayTTS_l (Sfring arg_0) { String[] f_args = {"s"}; Objectf] args = {arg_0}; gatewayServer.convertArgsToVXML("void", "sayTTS", f_args, args); gatewayServer.sendVXML("void", "sayTTS_l .vxml"); } }
The VoiceXML file get_number_array_3.vxml comprises the relevant subset of the file produced by the static VoiceXML compiler, presented in Section 3.2 above. Specifically, it contains the definitions of the five forms 'get_number_array', 'while_sub_ ', 'get_number', 'cond_sub_l' and 'recognition_subdialog_for_NUMBER'.
3.4 Running the dynamic VoiceXML program
This section describes the resulting behaviour when the dynamic VoiceXML code from
Section 3.3 is executed in the runtime environment of Section 2.5. In order to focus on communications between server-side and client-side processes, the initial steps (all of which can be considered as routine) are omitted. Thus we move directly to the invocation in execute_client_get_number_array_3 of the method convertArgsToVXML, at which point we have the following local variables: arg_0: an int array with three uninitialised elements. arg_l: a String array with three elements, whose values are "Say a number", "Say another number" and
"Say the last number". arg_2: an int with value 2. f_args: a String array whose three elements have the values "numbers", "prompts" and "size". args: an Object array whose three elements have the values arg_0, arg_l and Integer(2).
The invocation is : gatewayServer.convertArgsToVXML("void", "get_number_array", f_args, args) The purpose of this call is to produce a "header" piece of VoiceXML that makes a <subdialog> invocation of a form. This invocation returns a void value (the first argument); the VoiceXML <form> called is called "get_number_array" (the second argument); the names of the formal parameters for the call are in f_args (the third argument); and the run-time values of these parameters are in args (the fourth argument). The following piece of VoiceXML is produced, with comments as before in italics:
<form> Initialise the variable 'return Jo erver' with value 'void'
<var expr='"void'" name- 'return_to_server"/> Define the variable 'references '. This will hold all the server-side objects passed to the client <var name="references"/> Define variables to hold the specific parameter values
<var name="numbers"/> <var name="prompts"/> <var name="size"/> <block> <script>
<!fCDATA[
Initialise the 'size' parameter size = 3
Create an array to be the value of the 'references ' variable references = new Array()
Store the value of the 'size' parameter in element 0 of 'references ' references [0] = size
Create an array to be the value of the 'prompts ' variable prompts = new Array() Give this array an identity Jag whose value is 2 prompts.identity_tag = 2
Give this array a structurejype defining it as a string array with two elements prompts.structure_type = '3:string' Initialise the variables promptsfO] = 'Say a number' promptsf 1] = 'Say another number' prompts [2] = 'Say another number' referencesfl] = prompts Similarly for the 'numbers ' variable numbers = new Array() numbers.structure_type = '3:int' numbers.identity_tag = 1 numbersfO] = 'undefined' numbersfl] = 'undefined' numbers[2] = 'undefined' referencesf2] = numbers
]]> </script> </block>
Call the <form> element 'get umberjirray' with the prepared arguments <subdialog src="#get_number_array" name="subdialog_l "> <param expr="numbers" name- 'numbers"/> <param expr="prompts" name="prompts"/> <param expr='"3"' name="size"/>
</subdialog>
When the call has completed, use a <submit> to get back to the server side. Include the following information:
A) The return value 'return o jerver', which here is 'void' B) The array 'references ', which by construction contains all the client-side objects <block> <submit expr='"http://localhost:8080/gatewayclient'" method="post" namelist- 'return to server references"/> </block> </form>
In this piece of generated VoiceXML, the variable 'references' is used to hold client-side franslations of all the server-side objects that are passed to the call. The code in convertArgsToVXML also constructs an association list which associates the Java objects passed to the call (here arg_0 and arg_l) with the corresponding indices in 'references'. Thus the list associates arg_0 (the int array) with references[2], and arg_l (the String array) with referencesfl].
The following call to sendVXML combines the dynamically generated VoiceXML code immediately above with the pre-compiled VoiceXML fragment get_number_array_3.vxml described at the end of Section 3.3. It then sends the result through the Gateway to the client, and waits for the next Gateway request, which should be the result of the <submit> at the end of the dynamically generated VoiceXML fragment. When the Gateway receives this new request, it passes the body of the <submit> back to the sendVXML method. Suppose for example that the user responded "seven", "five" and
"ten" when prompted for the three numbers; these numbers were then entered into the array 'numbers', which is still accessible as element 2 of the array 'references'. With this in mind, the material returned from the Gateway can be readily understood:
return_to_server=void references [0]=2 references[2][0]=7 referencesfl][0]=Say%20a%20number return_to_server=void references[l].identity_tag=2 references[l].structure_type=3:string references [2] . identity_tag= 1 references[2][l]=5 references[2] .structure_type=3 : int referencesf 1 ] [ 1 ]=Say%20another%20number references [2] [2]= 10 references [ 1 ] [2]=Say%20the%201ast%20number
The final step is to use the association list to unpack this information back to the original server-side data-structures. Since the list associates arg_0 (the int array) with element 2 of 'references', the numbers 7, 5 and 10 are correctly entered into elements 0, 1 and 2 respectively of the int array arg_0, thereby completing the call.
Note that the second proxy call, to the method say TTS, is handled in a similar fashion, although is in fact somewhat more straightforward, and so will not be specifically discussed here.
4. Further Embodiments
The above embodiment has focused primarily on an implementation in which a subset of Java is compiled into a piece of dynamic VoiceXML, with Java running on the server and VoiceXML on the client. This section considers various further embodiments. Thus Section 4.1 discusses using a larger subset of Java; Section 4.2 discusses using procedural languages other than Java; and Section 4.3 discusses using voice scripting languages other than VoiceXML.
4.1 Using a larger subset of Java
The approach described above can be extended to handle a larger subset of Java if so desired on the following basis. Coverage of normal (non-static) methods and Java exceptions are discussed in Sections 4.1.1 and 4.1.2 below, respectively. Section 4.1.3 considers the question of whether there are any Java constructs for which there may not be a suitable translation into VoiceXML.
4.1.1 Non-static methods
The basic strategy for translating non-static methods into VoiceXML is first to reduce them to functions, with one function for each method name. Since the architecture tags each generated VoiceXML object with the type of the Java object to which it is intended to correspond, it is then possible to use run-time dispatching to delegate a method invocation to the translation of the individual method appropriate to object on which the method was invoked, together with its arguments.
To illustrate with a concrete example, suppose that we have a method M, with signature
String M(int n) defined for objects of the two classes Classl and Class2. This would be turned into three functions MGeneral, MClassl and MClass2, related roughly as follows:
String Mgeneral(Object o, int n) { if ( class_of_object(o) = "Classl" ) { return MClassl (o, n): } else { if (class_of_object(o) == "Class2" ) { return MClass2(o, n);
} else {
(some kind of error handling) } }
Here, MClassl is created from the definitions of the method M in Classl by adding the Classl object o on which the method is invoked as an extra argument. Direct references to data members of Classl in the definition of the method M are then translated into corresponding references to the data members of o. For example, suppose Classl has a String data member called 'message', and M is defined as follows:
String M(int n) { if( n < 0 ) { return "negative"; } else { return message; } }
MClassl is then defined as:
String M(Classl o, int n) { if( n < 0 ) { return "negative"; } else { return o.message;
} }
Functions like MClassl and MGeneral can be translated into VoiceXML using the methods aheady described. Translating non-static Java methods into functions of this kind can be done using standard techniques from the Java compiler literature.
4.1.2 Java exceptions
The are certain difficulties involved in translating Java exception handling into VoiceXML.
This results from VoiceXML's very non-standard treatment of exceptions. In contrast to nearly all other programming languages, exceptions are not passed back up the call stack to be made available to exception handlers, but can only be caught at the fixed syntactic levels of form, document or application.
A partial treatment of exceptions can be implemented in a straightforward manner by translating them into a special type of return value. This would be similar to the treatment of local variables in conditionals and iterative loops described in Sections 2.3.8 and 2.3.10. This type of solution is expected to work adequately for user-generated exceptions, i.e. exceptions intentionally created using the 'throw' construction; 'throw' will just translate into a special kind of 'return' statement.
A generalised scheme for handling exceptions not generated by user code, such as an exception resulting from a division by zero, is somewhat more problematic, although partial solutions that are likely to be sufficient in most practical situations are again feasible. One approach, for example, would be to trap non-user-generated exceptions at 'form' level, and then pass them upwards as return values.
4.1.3 Java constructs outside the ambit of VoiceXML
There are certain Java constructs that, by their very nature, do not have a suitable translation into VoiceXML. In particular, all constructs relating to multi-threading ('synchronized', etc) are untranslatable, since VoiceXML is single-threaded. It will be appreciated of course that it is entirely within a programmer's discretion whether or not to use multithreading in Java code. Restricting a developer from using these Java multithreading facilities does not represent a significant hindrance, given that such multithreading could not of course have been used anyway if the voice application had been originally coded directly into VoiceXML instead.
4.2 Using other procedural languages
The skilled person will appreciate that the approach described herein can be readily applied to suitably chosen subsets of other procedural languages like C, C++ and Perl, since there is a high degree of similarity between these languages at the conceptual level (the use of variables, loops, conditionals, and so on). Of course, within any given language there may be particular constructs for which it is difficult to find a suitable translation into VoiceXML (an example might be the use of pointers in C and C++), but this is unlikely to affect the overall functionality that can be achieved.
4.3 Using other voice scripting languages
The embodiment described herein has focused on the use of VoiceXML, since as mentioned above this is the de facto standard in this area. Further developments and modifications of VoiceXML are likely to appear in the future, but it is not anticipated that these will impact the utility of the present invention.
The general approach described herein can also be employed if another client-side voice scripting language were to be used instead of VoiceXML. One possible example here is the Speech Application Language Tags (SALT), which provides a set of extensions to HTML/XML for speech applications (see http://www.saltforum.org ). Of course, the particular implementation details will vary according to the specific client platform, but these will be within the ambit of the skilled person. Such changes to the client platform should not detract from the underlying strategy of designing an application as a single procedural program, and then being able to compile this program into an application distributed between (i) a procedural portion running on a server, and (ii) a voice dialogue portion running on the client.
***
Although above description has been presented primarily in the context of the embodiment of Figure 1, it will be appreciated that there are many other embodiments in which a voice browser could be employed. For example, the user could connect to the client via a computer network rather than a conventional telephone network (using a facility such as Voice of the Internet). It is also known or contemplated to support voice browsers on other forms of client system, some potentially quite different from a standard interactive voice response system. One example of this is where the client is a Personal Digital Assistant (PDA). In this case no telephony is involved, but the user speaks directly into the PDA, and receives audio output back from the PDA. Note that in this configuration the PDA still functions as a VoiceXML client, and is therefore directly compatible with the approach described above. Indeed, it will be appreciated that this wide range in the nature of the potential client device underlines the usefulness of annotations, in that the optimum distribution of processing between the server and client will clearly vary according to the properties of the client for any given application environment.
In conclusion therefore, although certain particular embodiments have been described in detail herein, it will be appreciated that this is by way of exemplification only. The skilled person will be aware of many further potential modifications and adaptations that fall within the scope of the claimed invention and its equivalents.

Claims

Claims
1. A method of developing a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, said method comprising the steps of: writing said voice application in a high-level procedural language; providing one or more annotations for said voice application, said annotations being indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client; transforming the part of the voice application to be performed on the client from said high- level procedural language into the voice mark-up language supported by the client in accordance with said one or more annotations; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in said second format on the client.
2. The method of claim 1, further comprising the step of providing a set of speech functions that can be invoked from said voice application in a high-level procedural language, wherein said speech functions are to be performed on the client.
3. The method of claim 2, wherein said step of transforming includes transforming any invocations of said speech functions into the voice mark-up language supported by the client, irrespective of whether or not said speech functions are included in the part of the voice application that is indicated by the annotations as to be executed on the client.
4. The method of any preceding claim, further comprising the step of dividing the voice application in a high-level procedural language into three portions: a first portion that is to be executed on the client, and which is transformed from said high- level procedural language into the voice mark-up language supported by the client; a second portion that is to be executed on the server to interact directly with said first portion, and which is modified to associate data objects in the first format on the server with data objects in said second format on the client. and a third portion that is to be executed on the server, but that does not interact directly with said first portion.
5. The method of claim 4, wherein said annotations identify functions that belong to said second portion of the voice application.
6. The method of claim 5, wherein functions that belong to the first portion are identified by determining the transitive closure of functions called by functions that belong to said second portion.
7. The method of any of claims 4 to 6, wherein a function in said second portion is replaced by a corresponding proxy that stores information specifying the associations between data objects in the first format on the server and data objects in said second format on the client.
8. The method of claim 7, wherein a proxy performs the run-time steps of translating the arguments of its corresponding function into code in said voice mark-up language, and transferring the code to the client for execution.
9. The method of any preceding step, wherein said steps of transforming and modifying comprise a compilation process.
10. A method of developing a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, said method comprising the steps of: writing said voice application in a high-level procedural language, wherein a part of the voice application is to be executed on the server, and a part of the voice application is to be executed on the client; and compiling said voice application written in the high-level procedural language, wherein the step of compiling includes the steps of: transforming the part of said voice application that is to be executed on the client platform into the voice mark-up language; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in said second format on the client.
11. The method of claim 10, wherein said voice mark-up language comprises VoiceXML.
12. The method of claim 11 , wherein conditional constructions in said high-level procedural language are compiled into VoiceXML conditional subdialog calls.
13. The method of claim 11 or 12, wherein loop constructions in said high-level procedural language are compiled into recursive VoiceXML subroutine calls.
14. The method of any of claims 10 to 13, wherein functions that are to be executed on the client platform in the voice mark-up language and that do not directly call basic speech functions on the client are compiled into ECMAScript-compatible code.
15. The method of any of claims 10 to 14, wherein an annotations file is used to determine which part of the voice application is to be executed on the client, and which part of the voice application is to be executed on the server.
16. A method of running a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, said method comprising the steps of: commencing said voice application in a high-level procedural language on the server, said application involving at least one data object in said first format; running said voice application on the server until processing is to be fransferred to the client; dynamically generating code on the server in said voice mark-up language from the voice application in a high-level procedural language, wherein said dynamically generated code supports the transformation of the at least one data object from said first format into said second format; and rendering said dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
17. The method of claim 16, wherein said voice application is commenced in response to a request from the client.
18. The method of claim 16 or 17, further comprising the initial step of compiling said voice application in a high-level procedural language on the server into three portions: a first portion that is to be executed on the client, and which is transformed from said high- level procedural language into the voice mark-up language supported by the client; a second portion that is to be executed on the server to interact directly with said first portion, and which is modified to associate data objects in the first format on the server with data objects in said second format on the client. and a third portion that is to be executed on the server, but that does not interact directly with said first portion.
19. The method of claim 18, wherein said second portion is responsible for dynamically generating code on the server in said voice mark-up language from the voice application in a high- level procedural language, wherein said dynamically generated code supports the transformation of the at least one data object from said first format into said second format.
20. The method of claim 18 or 19, wherein the dynamically generated code is combined with said first portion for rendering from the server to the client for execution on the client.
21. The method of any of claims 18 to 20, wherein said second portion maintains a table indicating the association between data objects on the server in said first format and data objects on the client in said second format.
22. The method of any of claims 16 to 21, further comprising the step of receiving a data object back from the client in said second format, and transforming said data object into said first format.
23. The method of claim 22, wherein an earlier version of said data object in said first format exists on the server, and said step of transforming said data object into said first format comprises associating the received data object with the earlier version of said data object, and updating the existing version of said data object in accordance with the received version of the data object.
24. Apparatus for developing a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, the voice application being written in a high-level procedural language and accompanied by one or more annotations indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client, said apparatus comprising: means for transforming the part of the voice application to be performed on the client from said high-level procedural language into the voice mark-up language supported by the client in accordance with said one or more annotations; and means for modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in said second format on the client.
25. Apparatus for developing a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, the voice application being written in a high-level procedural language, said apparatus comprising a compiler for performing a compilation process on said voice application written in the high-level procedural language, said compiler including: means for transforming the part of said voice application that is to be executed on the client platform into the voice mark-up language as part of said compilation process; and means for modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in said second format on the client as part of said compilation process.
26. A server for running a voice application in a client-server environment, wherein the server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, said server comprising: an application server system for commencing said voice application in a high-level procedural language on the server, said application involving at least one data object in said first format, wherein the voice application runs on the server until processing is to be transferred to the client; a dynamic compiler for generating code on the server in said voice mark-up language from the voice application in a high-level procedural language, wherein said dynamically generated code supports the transformation of the at least one data object from said first format into said second format; and a communications facility for rendering the dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
27. A computer program for use in developing a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, the voice application being written in a high-level procedural language and accompanied by one or more annotations indicative of which part of the voice application is to be executed on the server, and which part of the voice application is to be executed on the client, said program comprising instructions to perform the steps of: transforming the part of the voice application to be performed on the client from said high- level procedural language into the voice mark-up language supported by the client in accordance with said one or more annotations; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in said second format on the client.
28. A compiler for developing a voice application for a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, the voice application being written in a high-level procedural language, said compiler including program instructions for performing a compilation process on said voice application written in the high-level procedural language, wherein said compilation process includes: transforming the part of said voice application that is to be executed on the client platform into the voice mark-up language; and modifying the part of the voice application that is to be executed on the server in order to associate data objects in the first format on the server with data objects in said second format on the client.
29. A computer program providing a platform for running a voice application in a client-server environment, wherein a server supports a high-level procedural language in which data objects have a first format and a client supports a voice mark-up language in which data objects have a second format, said program including instructions for: commencing said voice application in a high-level procedural language on the server, said application involving at least one data object in said first format; running said voice application on the server until processing is to be transferred to the client; dynamically generating code on the server in said voice mark-up language from the voice application in a high-level procedural language, wherein said dynamically generated code supports the transformation of the at least one data object from said first format into said second format; and rendering said dynamically generated code in the voice mark-up language from the server to the client for execution on the client.
30. A computer program comprising instructions which when performed on a machine or machines implement the method of any of claims 1 to 23.
31. Apparatus for implementing the method of any of claims 1 to 23.
PCT/GB2002/001929 2002-04-26 2002-04-26 A system and method for creating voice applications WO2003091827A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/GB2002/001929 WO2003091827A2 (en) 2002-04-26 2002-04-26 A system and method for creating voice applications
AU2002253334A AU2002253334A1 (en) 2002-04-26 2002-04-26 A system and method for creating voice applications

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/GB2002/001929 WO2003091827A2 (en) 2002-04-26 2002-04-26 A system and method for creating voice applications

Publications (2)

Publication Number Publication Date
WO2003091827A2 true WO2003091827A2 (en) 2003-11-06
WO2003091827A3 WO2003091827A3 (en) 2004-03-04

Family

ID=29266192

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2002/001929 WO2003091827A2 (en) 2002-04-26 2002-04-26 A system and method for creating voice applications

Country Status (2)

Country Link
AU (1) AU2002253334A1 (en)
WO (1) WO2003091827A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7653650B2 (en) 2005-12-13 2010-01-26 International Business Machines Corporation Apparatus, system, and method for synchronizing change histories in enterprise applications
US7885958B2 (en) 2006-02-27 2011-02-08 International Business Machines Corporation Method, apparatus and computer program product for organizing hierarchical information
US9330668B2 (en) 2005-12-20 2016-05-03 International Business Machines Corporation Sharing voice application processing via markup
CN111984305A (en) * 2020-08-21 2020-11-24 腾讯科技(上海)有限公司 Application configuration method and device and computer equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6269336B1 (en) * 1998-07-24 2001-07-31 Motorola, Inc. Voice browser for interactive services and methods thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6269336B1 (en) * 1998-07-24 2001-07-31 Motorola, Inc. Voice browser for interactive services and methods thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DANIELSEN P J: "THE PROMISE OF A VOICE-ENABLED WEB" COMPUTER, IEEE COMPUTER SOCIETY, LONG BEACH., CA, US, US, vol. 33, no. 8, 1 August 2000 (2000-08-01), pages 104-106, XP000987575 ISSN: 0018-9162 *
HARTMAN J D ET AL: "VoiceXML builder: a workbench for investigating voiced-based applications" 2001, PISCATAWAY, NJ, USA, IEEE, USA, October 2001 (2001-10), pages S2C-6, XP002265475 ISBN: 0-7803-6669-7 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7653650B2 (en) 2005-12-13 2010-01-26 International Business Machines Corporation Apparatus, system, and method for synchronizing change histories in enterprise applications
US9330668B2 (en) 2005-12-20 2016-05-03 International Business Machines Corporation Sharing voice application processing via markup
US7885958B2 (en) 2006-02-27 2011-02-08 International Business Machines Corporation Method, apparatus and computer program product for organizing hierarchical information
CN111984305A (en) * 2020-08-21 2020-11-24 腾讯科技(上海)有限公司 Application configuration method and device and computer equipment
CN111984305B (en) * 2020-08-21 2023-08-08 腾讯科技(上海)有限公司 Application configuration method and device and computer equipment

Also Published As

Publication number Publication date
WO2003091827A3 (en) 2004-03-04
AU2002253334A8 (en) 2003-11-10
AU2002253334A1 (en) 2003-11-10

Similar Documents

Publication Publication Date Title
US7487440B2 (en) Reusable voiceXML dialog components, subdialogs and beans
JP4015375B2 (en) Server-side control object that handles client-side user interface elements
US7171672B2 (en) Distributed application proxy generator
JP4625198B2 (en) Server-side code generation from dynamic web page content files
JP4467205B2 (en) Postback input handling by server-side control objects
US8024196B1 (en) Techniques for creating and translating voice applications
KR100431972B1 (en) Structure skeletons for efficient voice navigation through generic hierarchical objects
US7844958B2 (en) System and method for creating target byte code
US8165883B2 (en) Application abstraction with dialog purpose
US20050028085A1 (en) Dynamic generation of voice application information from a web server
WO2003036930A1 (en) Web server controls for web enabled recognition and/or audible prompting
KR20080040644A (en) Speech application instrumentation and logging
JP2009500699A (en) Using strong data types to represent speech recognition grammars in software programs
US20090144711A1 (en) System and method for common compiler services based on an open services gateway initiative architecture
US7174006B2 (en) Method and system of VoiceXML interpreting
Mitrović et al. Adaptive user interface for mobile devices
US20050132323A1 (en) Systems and methods for generating applications that are automatically optimized for network performance
WO1999008182A1 (en) Method and apparatus for static and dynamic generation of information on a user interface
WO2003091827A2 (en) A system and method for creating voice applications
US7826600B2 (en) Method and procedure for compiling and caching VoiceXML documents in a voice XML interpreter
Latry et al. Staging telephony service creation: a language approach
US7266814B2 (en) Namespace based function invocation
Eberman et al. Building voiceXML browsers with openVXI
Turner Specifying and realising interactive voice services
Turner Formalizing graphical service descriptions using SDL

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP