Back Home Up Next                 

    My efforts are dedicated to my "Source of Inspiration..."

Architecture       anim1058.gif (13911 bytes)  
BASICS ] [ Architecture ] Text To Speech ] Speech Recognition ] Microphones ] Test Cases ] For OEM's ] Future ] SAPI4.0 ] Wow ]

 

Search for:

Hidden Markov Model

Introduction

The Speech API is implemented as a series of Component Object Model (COM) interfaces. This Section identifies the top-level objects, their child objects, and their methods.

The SAPI model is divided into two distinct levels:

High-level SAPI-This level provides basic speech services in the form of command-and-control speech recognition and simple text-to-speech output.
Low-level SAPI-This level provides detailed access to all speech services, including direct interfaces to control dialogs and manipulation of both speech recognition (SR) and text-to-speech (TTS) behavior attributes.

Each of the two levels of SAPI services has its own set of objects and methods.

Along with the two sets of COM interfaces, Microsoft has also published an OLE Automation type library for the high-level SAPI objects. This set of OLE objects is discussed at the end of the Section.

When We complete this Section you'll understand the basic architecture of the SAPI model, including all the SAPI objects and their uses. Detailed information about the object's methods and parameters will be covered in the next Section-"SAPI Basics."

 

Note

Most of the Microsoft Speech API is accessible only through C++ code. For this reason, many of the examples shown in this Section are expressed in Microsoft Visual C++ code. We do not need to be able to code in C++ in order to understand the information discussed here. At the end of this Section, the OLE Automation objects available through Visual Basic are also discussed.

 

High-Level SAPI

The high-level SAPI services provide access to basic forms of speech recognition and text-to-speech services. This is ideal for providing voice-activated menus, command buttons, and so on. It is also sufficient for basic rendering of text into speech.

The high-level SAPI interface has two top-level objects-one for voice command services (speech recognition), and one for voice text services (text-to-speech). The following two sections describe each of these top-level objects, their child objects, and the interfaces available through each object.

Voice Command

The Voice Command object is used to provide speech recognition services. It is useful for providing simple command-and-control speech services such as implementing menu options, activating command buttons, and issuing other simple operating system commands.

The Voice Command object has one child object and one collection object. The child object is the Voice Menu object and the collection object is a collection of enumerated menu objects (see Figure 15.1).

 

Voice Command Object

The Voice Command object supports three interfaces:

The Voice Command interface
The Attributes interface
The Dialogs interface

The Voice Command interface is used to enumerate, create, and delete voice menu objects. This interface is also used to register an application to use the SR engine. An application must successfully complete the registration before the SR engine can be used. An additional method defined for the Voice Command interface is the Mimic method. This is used to play back a voice command to the engine; it can be used to "speak" voice commands directly to the SR engine. This is similar to playing keystroke or mouse-action macros back to the operating system.

The Attributes interface is used to set and retrieve a number of basic parameters that control the behavior of the voice command system. We can enable or disable voice commands, adjust input gain, establish the SR mode, and control the input device (microphone or telephone).

The Dialogs interface gives We access to a series of dialog boxes that can be used as a standard set of input screens for setting and displaying SR engine information. The SAPI model identifies five different dialog boxes that should be available through the Dialogs interface. The exact layout and content of these dialog boxes is not dictated by Microsoft, but is determined by the developer of the speech recognition engine. However, Microsoft has established general guidelines for the contents of the SR engine dialog boxes. Table 15.1 lists each of the five defined dialog boxes along with short descriptions of their suggested contents.

The Voice Command dialog boxes.

Dialog Box Name Description
About Box Used to display the dialog box that identifies the SR engine and show its copyright information.
Command Verification Can be used as a verification pop-up window during a speech recognition session. When the engine identifies a word or phrase, this box can appear requesting the user to confirm that the engine has correctly understood the spoken command.
General Dialog Can be used to provide general access to the SR engine settings such as identifying the speaker, controlling recognition parameters, and the amount of disk space allotted to the SR engine.
Lexicon Dialog Can be used to offer the speaker the opportunity to alter the pronunciation lexicon, including altering the phonetic spelling of troublesome words, or adding or deleting personal vocabulary files.

 

The Voice Menu Object and the Menu Object Collection

The Voice Menu object is the only child object of the Voice Command object. It is used to allow applications to define, add, and delete voice commands in a menu. We can also use the Voice Menu object to activate and deactivate menus and, optionally, to provide a training dialog box for the menu.

The voice menu collection object contains a set of all menu objects defined in the voice command database. Microsoft SAPI defines functions to select and copy menu collections for use by the voice command speech engine.

The Voice Command Notification Callback

In the process of registering the application to use a voice command object, a notification callback (or sink) is established. This callback receives messages regarding the SR engine activity. Typical messages sent out by the SR engine can include notifications that the engine has detected commands being spoken, that some attribute of the engine has been changed, or that spoken commands have been heard but not recognized.

 

Note

Notification callbacks require a pointer to the function that will receive all related messages. Callbacks cannot be registered using Visual Basic; We need C or C++. However, the voice command OLE Automation type library that ships with the Speech SDK has a notification callback built into it.

 

Voice Text

The SAPI model defines a basic text-to-speech service called voice text. This service has only one object-the Voice Text object. The Voice Text object supports three interfaces:

The Voice Text interface
The Attributes interface
The Dialogs interface

The Voice Text interface is the primary interface of the TTS portion of the high-level SAPI model. The Voice Text interface provides a set method to start, pause, resume, fast forward, rewind, and stop the TTS engine while it is speaking text. This mirrors the VCR-type controls commonly employed for PC video and audio playback.

The Voice Text interface is also used to register the application that will request TTS services. An application must successfully complete the registration before the TTS engine can be used. This registration function can optionally pass a pointer to a callback function to be used to capture voice text messages. This establishes a notification callback with several methods, which are triggered by messages sent from the underlying TTS engine.

 

Note

Notification callbacks require a pointer to the function that will receive all related messages. Callbacks cannot be registered using Visual Basic; We need C or C++. However, the voice text OLE Automation type library that ships with the Speech SDK has a notification callback built into it.

 

The Attribute interface provides access to settings that control the basic behavior of the TTS engine. For example, We can use the Attributes interface to set the audio device to be used, set the playback speed (in words per minute), and turn the speech services on and off. If the TTS engine supports it, We can also use the Attributes interface to select the TTS speaking mode. The TTS speaking mode usually refers to a predefined set of voices, each having its own character or style (for example, male, female, child, adult, and so on).

The Dialogs interface can be used to allow users the ability to set and retrieve information regarding the TTS engine. The exact contents and layout of the dialog boxes are not determined by Microsoft but by the TTS engine developer. Microsoft does, however, suggest the possible contents of each dialog box. Table 15.2 shows the four voice text dialogs defined by the SAPI model, along with short descriptions of their suggested contents.

Table 15.2. The Voice Text dialog boxes.

Dialog Name Description
About Box Used to display the dialog box that identifies the TTS engine and shows its copyright information.
Lexicon Dialog Can be used to offer the speaker the opportunity to alter the pronunciation lexicon, including altering the phonetic spelling of troublesome words, or adding or deleting personal vocabulary files.
General Dialog Can be used to display general information about the TTS engine. Examples might be controlling the speed at which the text will be read, the character of the voice that will be used for playback, and other user preferences as supported by the TTS engine.
Translate Dialog Can be used to offer the user the ability to alter the pronunciation of key words in the lexicon. For example, the TTS engine that ships with Microsoft Voice has a special entry that forces the speech engine to express all occurrences of "TTS" as "text to speech," instead of just reciting the letters "T-T-S."

 

Low-Level SAPI

The low-level SAPI services provide access to a much greater level of control of Windows speech recognition and text-to-speech services. This level is best for implementing advanced SR and TTS services, including the creation of dictation systems.

Just as there are two basic service types for high-level SAPI, there are two primary COM interfaces defined for low-level SAPI-one for speech recognition and one for text-to-speech services. The rest of this Section outlines each of the objects and their interfaces.

 

Note

This part of the Section covers the low-level SAPI services. These services are available only from C or C++ programs-not Visual Basic. However, even if We do not program in C, We can still learn a lot from this section of the Section. The material in this section can give us a good understanding of the details behind the SAPI OLE automation objects, and may also give us some ideas on how We can use the VB-level SAPI services in our programs.

 

Speech Recognition

The Speech Recognition object has several child objects and collections. There are two top-level objects in the SR system: the SR Engine Enumerator object and the SR Sharing object. These two objects are created using their unique CLSID (class ID) values. The purpose of both objects is to give an application information about the available speech recognition engines and allow the application to register with the appropriate engine. Once the engine is selected, one or more grammar objects can be created, and as each phrase is heard, an SR Results object is created for each phrase. This object is a temporary object that contains details about the phrase that was captured by the speech recognition engine. Figure 15.2 shows how the different objects relate to each other, and how they are created.

 

When an SR engine is created, a link to a valid audio input device is also created. While it is possible to create a custom audio input device, it is not required. The default audio input device is an attached microphone, but can also be set to point to a telephone device.

The rest of this section details the low-level SAPI SR objects and their interfaces.

The SR Enumerator and Engine Enumerator Objects

The role of the SR Enumerator and Engine Enumerator objects is to locate and select an appropriate SR engine for the requesting application. The Enumerator object lists all available speech recognition modes and their associated installed engines. This information is supplied by the child object of the Enumerator object: the Engine Enumerator object. The result of this search is a pointer to the SR engine interface that best meets the service request.

The Enumerator and Engine Enumerator objects support only two interfaces:

The ISREnum interface is used to get a list of all available engines.
The ISRFind interface is used to select the desired engine.

 

Note

The SR Enumerator and Engine Enumerator objects are used only to locate and select an engine object. Once that is done, these two objects can be discarded.

 

The SR Sharing Object

The SR Sharing object is a possible replacement for the SR Enumerator and Engine Enumerator objects. The SR Sharing object uses only one interface, the ISRSharing interface, to locate and select an engine object that will be shared with other applications on the PC. In essence, this allows for the registration of a requesting application with an out-of-process memory SR server object. While often slower than creating an instance of a private SR object, using the Sharing object can reduce strain on memory resources.

The SR Sharing interface is an optional feature of speech engines and may not be available depending on the design of the engine itself.

The SR Engine Object

The SR Engine Object is the heart of the speech recognition system. This object represents the actual speech engine and it supports several interfaces for the monitoring of speech activity. The SR Engine is created using the Select method of the ISREnum interface of the SR Enumerator object described earlier. Table 15.3 lists the interfaces supported by the SR Engine object along with a short description of their uses.

The interfaces of the SR Engine object.

Interface Name Description
ISRCentral The main interface for the SR Engine object. Allows the loading and unloading of grammars, checks information status of the engine, starts and stops the engine, and registers and releases the engine notification callback.
ISRDialogs Used to display a series of dialog boxes that allow users to set parameters of the engine and engage in training to improve the SR performance.
ISRAttributes Used to set and get basic attributes of the engine, including input device name and type, volume controls, and other information.
ISRSpeaker Allows users to manage a list of speakers that use the engine. This is especially valuable when more than one person uses the same device. This is an optional interface.
ISRLexPronounce This interface is used to provide users access to modify the pronunciation or playback of certain words in the lexicon. This is an optional interface.

 

The SR Engine object also provides a notification callback interface (ISRNotifySink) to capture messages sent by the engine. These messages can be used to check on the performance status of the engine, and can provide feedback to the application (or speaker) that can be used to improve performance.

The Grammar Object

The Grammar object is a child object of the SR Engine object. It is used to load parsing grammars for use by the speech engine in analyzing audio input. The Grammar object contains all the rules, words, lists, and other parameters that control how the SR engine interprets human speech. Each phrase detected by the SR engine is processed using the loaded grammars.

The Grammar object supports three interfaces:

ISRGramCFG-This interface is used to handle grammar functions specific to context-free grammars, including the management of lists and rules.
ISRGramDictation-This interface is used to handle grammar functions specific to dictation grammars, including words, word groups, and sample text.
IRSGramCommon-This interface is use to handle tasks common to both dictation and context-free grammars. This includes loading and unloading grammars, activating or deactivating a loaded grammar, training the engine, and possibly storing SR results objects.

The Grammar object also supports a notification callback to handle messages regarding grammar events. Optionally, the grammar object can create an SR Results object. This object is discussed fully in the next section.

The SR Results Object

The SR Results object contains detailed information about the most recent speech recognition event. This could include a recorded representation of the speech, the interpreted phrase constructed by the engine, the name of the speaker, performance statistics, and so on.

 

Note

The SR Results object is optional and is not supported by all engines.

 

Table 15.4 shows the interfaces defined for the SR Results object, along with descriptions of their use. Only the first interface in the table is required (the ISRResBasic interface).

The defined interfaces for the SR Results object.

Interface Name Description
ISRResBasic Used to provide basic information about the results object, including an audio representation of the phrase, the selected interpretation of the audio, the grammar used to analyze the input, and the start and stop time of the recognition event.
ISRResAudio Used to retrieve an audio representation of the recognized phrase. This audio file can be played back to the speaker or saved as a WAV format file for later review.
ISRResGraph Used to produce a graphic representation of the recognition event. This graph could show the phonemes used to construct the phrase, show the engine's "score" for accurately detecting the phrase, and so on.
ISRResCorrection Used to provide an opportunity to confirm that the interpretation was accurate, possibly allowing for a correction in the analysis.
ISRResEval Used to re-evaluate the results of the previous recognition. This could be used by the engine to request the speaker to repeat training phrases and use the new information to re-evaluate previous interpretations.
ISRResSpeaker Used to identify the speaker performing the dictation. Could be used to improve engine performance by comparing stored information from previous sessions with the same speaker.
ISRResModifyGUI Used to provide a pop-up window asking the user to confirm the engine's interpretation. Could also provide a list of alternate results to choose from.
ISRResMerge Used to merge data from two different recognition events into a single unit for evaluation purposes. This can be done to improve the system's knowledge about a speaker or phrase.
ISRResMemory Used to allocate and release memory used by results objects. This is strictly a housekeeping function.

 

Text-to-Speech

The low-level text-to-speech services are provided by one primary object-the TTS Engine object. Like the SR object set, the TTS object set has an Enumerator object and an Engine Enumerator object. These objects are used to locate and select a valid TTS Engine object and are then discarded.

 

The TTS services also use an audio output object. The default object for output is the PC speakers, but this can be set to the telephone device. Applications can also create their own output devices, including the creation of a WAV format recording device as the output for TTS engine activity.

The rest of this section discusses the details of the low-level SAPI TTS objects.

The TTS Enumerator and Engine Enumerator Objects

The TTS Enumerator and Engine Enumerator objects are used to obtain a list of the available TTS engines and their speaking modes. They both support two interfaces:

ITTSEnum-Used to obtain a list of the available TTS engines.
ITTSFind-Used to obtain a pointer to the requested TTS engine.

Once the objects have provided a valid address to a TTS engine object, the TTS Enumerator and Engine Enumerator objects can be discarded.

The TTS Engine Object

The TTS Engine object is the primary object of low-level SAPI TTS services. The Engine object supports several interfaces. Table 15.5 lists the interfaces used for the translations of text into audible speech.

Table 15.5. The TTS Engine object interfaces.

Interface Name Description
ITTSCentral The main interface for the TTS engine object. It is used to register an application with the TTS system, starting, pausing, and stopping the TTS playback, and so on.
ITTSDialogs Used to provide a connection to several dialog boxes. The exact contents of each dialog box is determined by the engine provider, not by Microsoft. Dialog boxes defined for the interface are:
About Box
General Dialog
Lexicon Dialog
Training Dialog
ITTSAttributes Used to set and retrieve control parameters of the TTS engine, including playback speed and volume, playback device, and so on.

 

In addition to the interfaces described in Table the TTS Engine object supports two notification callbacks:

ITTSNotifySink-Used to send the application messages regarding the playback of text as audio output, including start and stop of playback and other events.
ITTSBufNotifysink-Used to send messages regarding the status of text in the playback buffer. If the content of the buffer changes, messages are sent to the application using the TTS engine.

Speech Objects and OLE Automation

Microsoft supplies an OLE Automation type library with the Speech SDK. This type library can be used with any VBA-compliant software, including Visual Basic, Access, Excel, and others. The OLE Automation set provides high-level SAPI services only. The objects, properties, and methods are quite similar to the objects and interfaces provided by the high-level SAPI services described at the beginning of this Section.

There are two type library files in the Microsoft Speech SDK:

VCAUTO.TLB supplies the speech recognition services.
VTXTAUTO.TLB supplies the text-to-speech services.

We can load these libraries into a Visual Basic project by way of the Tools | References menu item.

 

OLE Automation Speech Recognition Services

The OLE Automation speech recognition services are implemented using two objects:

The OLE Voice Command object
The OLE Voice Menu Ƞ object

The OLE Voice Command object has three properties and two methods. Table 15.6 shows the Voice Command object's properties and methods, along with their parameters and short descriptions.

The properties and methods of the OLE Voice Command object.

Property/Method Name Parameters Description
Register method   This method is used to register the application with the SR engine. It must be called before any speech recognition will occur.
CallBack property Project.Class as string Visual Basic 4.0 programs can use this property to identify an existing class module that has two special methods defined. (See the following section, "Using the Voice Command Callback.")
Awake property TRUE/FALSE Use this property to turn on or off speech recognition for the application.
CommandSpoken property cmdNum as integer Use this property to determine which command was heard by the SR engine. VB4 applications do not need to use this property if they have installed the callback routines described earlier. All other programming environments must poll this value (using a timer) to determine the command that has been spoken.
MenuCreate method appName as String,
state as String,
langID as Integer,
dialect as String,
flags as Long
Use this method to create a new menu object. Menu objects are used to add new items to the list of valid commands to be recognized by the SR engine.

 

Using the Voice Command Callback

The Voice Command type library provides a unique and very efficient method for registering callbacks using a Visual Basic 4.0 class module. In order to establish an automatic notification from the SR engine, all We need to do is add a VB4 class module to our application. This class module must have two functions created:

CommandRecognize-This event is fired each time the SR engine recognizes a command that belongs to our application's list.
CommandOther-This event is fired each time the SR engine receives spoken input it cannot understand.

Listing shows how these two routines look in a class module.

Listing 15.1. Creating the notification routines for the Voice Command object.

'Sent when a spoken phrase was either recognized as being from another Âapplication's
'command set or was not recognized.
Function CommandOther(pszCommand As String, pszApp As String, pszState As String)
    If Len(pszCommand) = 0 Then
        VcintrForm.StatusMsg.Text = "Command unrecognized" & Chr(13) & Chr(10) & ÂVcintrForm.StatusMsg.Text
    Else
        VcintrForm.StatusMsg.Text = pszCommand & " was recognized from " & pszApp & Â"'s " & pszState & " menu" & Chr(13) & Chr(10) & VcintrForm.StatusMsg.Text
    End If

End Function


'Sent when a spoken phrase is recognized as being from the application's Âcommandset.
Function CommandRecognize(pszCommand As String, dwID As Long)
    VcintrForm.StatusMsg.Text = pszCommand & Chr(13) & Chr(10) & ÂVcintrForm.StatusMsg.Text
End Function

 

The Voice Menu Object

The OLE Voice Menu object is used to add new commands to the list of valid items that can be recognized by the SR engine. The Voice Menu object has two properties and three methods. Table 15.7 shows the Voice Menu object's methods and properties, along with parameters and short descriptions.

The properties and methods of the OLE Voice Menu object.

Property/Method Parameters Description
hWndMenu property hWnd as long Sets the window handle for a voice menu. Whenever this window is the foreground window, the voice menu is automatically activated; otherwise, it is deactivated. If this property is set to NULL, the menu is global.
Active property TRUE/FALSE Use this to turn the menu on or off. If this is set to TRUE, the menu is active. The menu must be active before its commands will be recognized by the SR engine.
Add method id as Long,
command as String,
category as String,
description as String
Adds a new menu to the list of recognizable menus. The command parameter contains the actual menu item the SR engine will listen for. The id parameter will be returned when the SR engine recognizes that the command has been spoken. The other parameters are optional.
Remove method id as Long Removes an item from the menu list. The id parameter is the same value used to create the menu in the Add method.
ListSet method Name as String,
Elements as Long,
Data as String
Add a list of possible entries for use with a command (see "Using Command Lists with the Voice Menu Object" later in this Section). Name is the name of the list referred to in a command. Elements is the total number of elements in this list. Data is the set of elements, separated by a chr(0).

 

Using Command Lists with the Voice Menu Object

The Voice Menu object allows We to define a command that refers to a list. We can then load this list into the grammar using the ListSet method. For example, We can use the Add method to create a command to send e-mail messages. Then We can use the ListSet method to create a list of people to receive e-mail

Listing 15.2. Using the Add and ListSet methods of the Voice Menu object.

Dim Names
Dim szNULL as String
szNULL = Chr(0)
 
 
Call vMenu.Add(109, "Send email to <Names>")
Names = "Larry" & szNULL & "Mike" & szNULL & "Gib" & szNULL & "Doug" & szNULL & Â"George" & szNull
Call Vmenu.ListSet("Names", 5, Names)

OLE Automation Text-to-Speech Services

We can gain access to the OLE Automation TTS services using only one object-the Voice Text object. The Voice Text object has four properties and seven methods. Table 15.8 shows the properties and methods, along with their parameters and short descriptions.

The properties and methods of the Voice Text object.

Property/Method Parameters Description
Register method AppName as string Used to register the application with the TTS engine. This must be called before any other methods are called.
Callback property Project.Class as string This property is used to establish a callback interface between the Voice Text object and our program. See the "Using the Voice Text Callback" section later in this Section.
Enabled property TRUE/FALSE Use this property to turn the TTS service on or off. This must be set to TRUE for the Voice Text object to speak text.
Speed property lSpeed as Long Setting this value controls the speed (in words per minute) at which text is spoken. Setting the value to 0 sets the slowest speed. Setting the value to -1 sets the fastest speed.
IsSpeaking TRUE/FALSE Indicates whether the TTS engine is currently speaking text. We can poll this read-only property to determine when the TTS engine is busy or idle. Note that VB4 programmers should use the Callback property instead of this property.
Speak method cText as string,
lFlags as Long
Use this method to get the TTS engine to speak text. The lFlags parameter can contain a value to indicate this is a statement, question, and so on.
StopSpeaking method (none) Use this method to force the TTS engine to stop speaking the current text.
AudioPause (none) Use this method to pause all TTS activity. This affects all applications using TTS services at this site (PC).
AudioResume (none) Use this method to resume TTS activity after calling AudioPause. This affects all applications using TTS services at this site (PC).
AudioRewind (none) Use this method to back up the TTS playback approximately one phrase or sentence.
AudioFastForward (none) Use this method to advance the TTS engine approximately one phrase or sentence.

 

Using the Voice Text Callback

The Voice Text type library provides a unique and very efficient method for registering callbacks using a Visual Basic 4.0 class module. In order to establish an automatic notification from the TTS engine, all We need to do is add a VB4 class module to our application. This class module must have two functions created:

SpeakingStarted-This event is fired each time the TTS engine begins speaking text.
SpeakingDone-This event is fired each time the TTS engine stops speaking text.

Creating the notification routines for a Voice Text object.

Function SpeakingDone()
    VtintrForm.StatusMsg.Text = "Speaking Done notification" & Chr(13) & Chr(10) & ÂVtintrForm.StatusMsg.Text
End Function

Function SpeakingStarted()
    VtintrForm.StatusMsg.Text = "Speaking Started notification" & Chr(13) & Chr(10) Â& VtintrForm.StatusMsg.Text
End Function

Only VB4 applications can use this method of establishing callbacks through class modules. If We are using the TTS objects with other VBA-compatible languages, We need to set up a routine, using a timer, that will regularly poll the IsSpeaking property. The IsSpeaking property is set to TRUE while the TTS engine is speaking text.

Home ] Up ] BASICS ] [ Architecture ] Text To Speech ] Speech Recognition ] Microphones ] Test Cases ] For OEM's ] Future ] SAPI4.0 ] Wow ]

 

 

Please sign my guest book:

Send mail to askazad@hotmail.com with questions or comments about this web site.
Copyright © 2001 Engineered Station
Last modified: July 06, 2001

This site is been visited by Hit Counter    surfers