W3C Voice Browser Activity
-
Standards for Voice and Dialogue applications
-
VoiceXML
-
SRGS
-
SISR
-
SSML
-
PLS
-
Call Control XML
-
State Chart XML
-
…
-
-
W3C Recommendations
VoiceXML
-
Language for dialogue applications development.
-
Primary targeted to phone applications.
-
telephone support automation
-
railways/bus schedules information
-
ticket reservation
-
…
-
-
Describes algorithm for dialogue flow control (dialogue strategy)
-
Alternatively can be described by finite state automaton with output (Mealy automaton)
-
SCXML
-
-
W3C standard W3C (present version 2.1, version 3.0 in state of Working Draft)
VoiceXML - processing
-
Application needs to be run on VoiceXML platform or using VoiceXML interpreter.
-
desktop platforms - OptimTalk, publicVoiceXML, JVoiceXML, …
-
opensource on-line - Asterisk+VoiceGlue, Asterisk+OpenVXI, …
-
on-line commercial:
-
Bevocal Cafe
-
Voxeo Prophecy
-
…
-
-
VoiceXML forms in XHTML documents
-
using namespaces (formerly W3C submission XHTML+Voice profile 1.0)
-
Support in Opera a Firefox web browsers.
-
-
…
-
VoiceXML - example
Figure: VoiceXML example
<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
<form id="pizza-mixed">
<grammar src="pizza.grxml"/>
<initial name="pizzaall">
<prompt>Welcome to FI pizzeria</prompt>
<nomatch count="2"><assign name="pizzaall" expr="true"/></nomatch>
<noinput count="2"><assign name="pizzaall" expr="true"/></noinput>
</initial>
<field name="kind">
<prompt>What kind of pizza do you want?</prompt>
<nomatch>We have salami, mozzarela and appolo pizza</nomatch>
<noinput>We have salami, mozzarela and appolo pizza</noinput>
<grammar src="pizza.grxml#kind"/>
</field>
<field name="topping">
<prompt>What topping do you want?</prompt>
<nomatch>We offer ketchup and chilli.</nomatch>
<noinput>We offer ketchup and chilli.</noinput>
<grammar src="pizza.grxml#topping"/>
</field>
<field name="drink">
<prompt>What do you want to drink?</prompt>
<nomatch>Select one of coke, sprite and water</nomatch>
<noinput>Select one of coke, sprite and water</noinput>
<grammar src="pizza.grxml#drink"/>
</field>
<field name="ack">
<prompt>Did you ordered <value expr="kind"/> pizza with <value
expr="topping"/> and <value expr="drink"/>?</prompt>
<grammar src="yesno.grxml"/>
</field>
<filled>
<if cond="ack=='yes'">
<prompt>Order submitted</prompt>
<else/>
<clear namelist="kind topping drink ack"/>
</if>
</filled>
</form>
</vxml>
SRGS (Speech Recognition Grammar Specification)
-
Standard for description of context free grammars.
-
describes the accepted inputs of particular VoiceXML fields
-
-
Part of W3C Voice Browser Activity standards
-
Present version 1.0
-
SRGS - motivation
-
User’s voice input needs to be recognized - continues speech recognition.
-
success rate 50-99 %
-
-
Possibilities how to improve success rate:
-
improve the language model
-
problem domain restriction
-
improve the user model
-
-
Problem domain restriction + language model improvement = SRGS.
SRGS - example
Figure: SRGS grammar referenced in the previous VoiceXML example (pizza.grxml)
<?xml version="1.0" encoding="UTF-8"?>
<grammar root="mixed" xml:lang="en_US">
<rule id="mixed">
<item>
<ruleref special="GARBAGE"/>
<ruleref uri="#kind"/> pizza <ruleref special="GARBAGE"/>
<ruleref uri="#topping"/> and <ruleref uri="#drink"/>
</item>
<tag>
{
out.kind=rules.kind;
out.topping=rules.topping;
out.drink=rules.drink;
}
</tag>
</rule>
<rule id="kind">
<one-of>
<item>salami</item>
<item>mozzarela</item>
<item>polo</item>
</one-of>
</rule>
...
</grammar>
SISR (Semantic Interpretation for Speech Recognition)
-
Purpose:
-
What is the meaning of recognized input?
-
-
Language for derivation of the recognized inputs semantic.
-
Based on ECMAScript.
-
Used in speech recognition grammars (see previous slide).
SSML (Speech Synthesis Markup Language)
-
W3C Standard
-
present version 1.1 (September 2010)
-
Used to describe prosody characteristics of synthesized speech.
-
loudness
-
prosody
-
emphasis
-
speech rate
-
voice kind (male, female, neutral)
-
…
-
-
Contains markup for description of pronunciation of foreign words.
-
IPA (International Phonetic Alphabet) can be utilized.
-
SSML - example of loudness and breaks
Figure: SSML Breaks and loudness control example
<?xml version="1.0" encoding="utf-8"?>
<speak version='1.1' xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis11/synthesis.xsd">
<prosody volume="loud">
Dobre rano.<break/>
<prosody>
<prosody volume="default">
Jak se mate?
</prosody>
</speak>
SSML - example of intonation modeling
Figure: SSML Intonation modeling
<speak ...>
<prosody contour="(0%,50Hz) (75%, +10%) (80%, +20%) (90%,+30%)">
Mas se dobre?
</prosody>
</speak>
PLS (Pronunciation Lexicon Specification)
-
Pronunciation Lexicon Specification
-
W3C standard
-
Actual version - 1.0, October 2008
-
-
Developed for description of pronunciation of words, abbreviations, etc.
-
Used for:
-
Speech synthesis (SSML) - pronunciation of
-
foreign words
-
abbreviations
-
number values
-
…
-
-
Speech recognition (SRGS) - PLS allows to describe different pronunciations of some words (needed to be correctly recognized).
-
PLS Structure
-
Root element - lexicon
-
contains one or more lexicon entries - lexeme element
-
contains:
-
one or more word notations - grapheme element
-
one or more word pronunciation - phoneme element
-
pronunciation may be written using IPA, SAMPA, etc
-
-
-
-
PLS - example
Figure: PLS pronunciation example
<?xml version="1.0" encoding="utf-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="cs-CZ">
<lexeme>
<grapheme>CSR</grapheme>
<phoneme>tʃˈeː ˈes ˈer</phoneme>
<phoneme>tʃˈeskaː rˈepublˌika</phoneme>
</lexeme>
</lexicon>
Call Control XML
-
Provides declarative markup to describe telephony call control
-
directing calls to corresponding application/human
-
merging multiple calls into a conference call
-
the ability to place outgoing calls
-
handling for a richer class of asynchronous events
-
handling the outside call queue for VoiceXML
-
etc.
-
State Chart XML
-
W3C Recommendation (September 2015) of event-based state machine.
-
General-purpose event-based state machine language.
-
Based on:
-
Harel State Tables (included in UML for example)
State Chart XML - Relation to Dialogue
-
Dialogue can be modeled using Mealy Automaton.
-
Mealy automaton - finite state automaton with an output function.
-
States of the automaton corresponds to the states of the dialogue.
-
Transition is function of the user input.
-
Output function is the dialogue system response.
-
-
Mealy automaton can be described using the SCXML (see example)
SCXML - Demo
Example 1: Process planing demo
(if the image does not show, click here - Process state diagram)
SCXML - Demo
Example 1: Corresponding SCXML
<?xml version="1.0" encoding="UTF-8"?>
<scxml version="1.0" xmlns="http://www.w3.org/2005/07/scxml">
<initial>
<transition target="Created" type="external"/>
</initial>
<state id="Created">
<transition target="Waiting" event="enqueue"/>
</state>
<state id="Waiting">
<transition target="Running" event="assign"/>
</state>
<state id="Running">
<transition target="Blocked" event="wait for resource"/>
<transition target="Waiting" event="timeout"/>
<transition target="Terminated" event="terminate"/>
</state>
<state id="Blocked">
<transition target="Waiting" event="resource available"/>
</state>
<final id="Terminated"/>
</scxml>