|
|
IntroductionThe Speech API is implemented as a series of Component Object Model (COM) interfaces. This Section identifies the top-level objects, their child objects, and their methods. The SAPI model is divided into two distinct levels:
Each of the two levels of SAPI services has its own set of objects and methods. Along with the two sets of COM interfaces, Microsoft has also published an OLE Automation type library for the high-level SAPI objects. This set of OLE objects is discussed at the end of the Section. When We complete this Section you'll understand the basic architecture of the SAPI model, including all the SAPI objects and their uses. Detailed information about the object's methods and parameters will be covered in the next Section-"SAPI Basics."
High-Level SAPIThe high-level SAPI services provide access to basic forms of speech recognition and text-to-speech services. This is ideal for providing voice-activated menus, command buttons, and so on. It is also sufficient for basic rendering of text into speech. The high-level SAPI interface has two top-level objects-one for voice command services (speech recognition), and one for voice text services (text-to-speech). The following two sections describe each of these top-level objects, their child objects, and the interfaces available through each object. Voice CommandThe Voice Command object is used to provide speech recognition services. It is useful for providing simple command-and-control speech services such as implementing menu options, activating command buttons, and issuing other simple operating system commands. The Voice Command object has one child object and one collection object. The child object is the Voice Menu object and the collection object is a collection of enumerated menu objects (see Figure 15.1).
Voice Command ObjectThe Voice Command object supports three interfaces:
The Voice Command interface is used to enumerate, create, and delete voice menu objects. This interface is also used to register an application to use the SR engine. An application must successfully complete the registration before the SR engine can be used. An additional method defined for the Voice Command interface is the Mimic method. This is used to play back a voice command to the engine; it can be used to "speak" voice commands directly to the SR engine. This is similar to playing keystroke or mouse-action macros back to the operating system. The Attributes interface is used to set and retrieve a number of basic parameters that control the behavior of the voice command system. We can enable or disable voice commands, adjust input gain, establish the SR mode, and control the input device (microphone or telephone). The Dialogs interface gives We access to a series
of dialog boxes that can be used as a standard set of input screens for setting and
displaying SR engine information. The SAPI model identifies five different dialog boxes
that should be available through the Dialogs
interface. The exact layout and content of these dialog boxes is not dictated by
Microsoft, but is determined by the developer of the speech recognition engine. However,
Microsoft has established general guidelines for the contents of the SR engine dialog
boxes. Table 15.1 lists each of the five defined dialog boxes along with short
descriptions of their suggested contents. The Voice Command dialog boxes.
The Voice Menu Object and the Menu Object CollectionThe Voice Menu object is the only child object of the Voice Command object. It is used to allow applications to define, add, and delete voice commands in a menu. We can also use the Voice Menu object to activate and deactivate menus and, optionally, to provide a training dialog box for the menu. The voice menu collection object contains a set of all menu objects defined in the voice command database. Microsoft SAPI defines functions to select and copy menu collections for use by the voice command speech engine. The Voice Command Notification CallbackIn the process of registering the application to use a voice command object, a notification callback (or sink) is established. This callback receives messages regarding the SR engine activity. Typical messages sent out by the SR engine can include notifications that the engine has detected commands being spoken, that some attribute of the engine has been changed, or that spoken commands have been heard but not recognized.
Voice TextThe SAPI model defines a basic text-to-speech service called voice text. This service has only one object-the Voice Text object. The Voice Text object supports three interfaces:
The Voice Text interface is the primary interface of the TTS portion of the high-level SAPI model. The Voice Text interface provides a set method to start, pause, resume, fast forward, rewind, and stop the TTS engine while it is speaking text. This mirrors the VCR-type controls commonly employed for PC video and audio playback. The Voice Text interface is also used to register the application that will request TTS services. An application must successfully complete the registration before the TTS engine can be used. This registration function can optionally pass a pointer to a callback function to be used to capture voice text messages. This establishes a notification callback with several methods, which are triggered by messages sent from the underlying TTS engine.
The Attribute interface provides access to settings that control the basic behavior of the TTS engine. For example, We can use the Attributes interface to set the audio device to be used, set the playback speed (in words per minute), and turn the speech services on and off. If the TTS engine supports it, We can also use the Attributes interface to select the TTS speaking mode. The TTS speaking mode usually refers to a predefined set of voices, each having its own character or style (for example, male, female, child, adult, and so on). The Dialogs interface can be used to allow users
the ability to set and retrieve information regarding the TTS engine. The exact contents
and layout of the dialog boxes are not determined by Microsoft but by the TTS engine
developer. Microsoft does, however, suggest the possible contents of each dialog box.
Table 15.2 shows the four voice text dialogs defined by the SAPI model, along with short
descriptions of their suggested contents. Table 15.2. The Voice Text dialog boxes.
Low-Level SAPIThe low-level SAPI services provide access to a much greater level of control of Windows speech recognition and text-to-speech services. This level is best for implementing advanced SR and TTS services, including the creation of dictation systems. Just as there are two basic service types for high-level SAPI, there are two primary COM interfaces defined for low-level SAPI-one for speech recognition and one for text-to-speech services. The rest of this Section outlines each of the objects and their interfaces.
Speech RecognitionThe Speech Recognition object has several child objects and collections. There are two top-level objects in the SR system: the SR Engine Enumerator object and the SR Sharing object. These two objects are created using their unique CLSID (class ID) values. The purpose of both objects is to give an application information about the available speech recognition engines and allow the application to register with the appropriate engine. Once the engine is selected, one or more grammar objects can be created, and as each phrase is heard, an SR Results object is created for each phrase. This object is a temporary object that contains details about the phrase that was captured by the speech recognition engine. Figure 15.2 shows how the different objects relate to each other, and how they are created.
When an SR engine is created, a link to a valid audio input device is also created. While it is possible to create a custom audio input device, it is not required. The default audio input device is an attached microphone, but can also be set to point to a telephone device. The rest of this section details the low-level SAPI SR objects and their interfaces. The SR Enumerator and Engine Enumerator ObjectsThe role of the SR Enumerator and Engine Enumerator objects is to locate and select an appropriate SR engine for the requesting application. The Enumerator object lists all available speech recognition modes and their associated installed engines. This information is supplied by the child object of the Enumerator object: the Engine Enumerator object. The result of this search is a pointer to the SR engine interface that best meets the service request. The Enumerator and Engine Enumerator objects support only two interfaces:
The SR Sharing ObjectThe SR Sharing object is a possible replacement for the SR Enumerator and Engine Enumerator objects. The SR Sharing object uses only one interface, the ISRSharing interface, to locate and select an engine object that will be shared with other applications on the PC. In essence, this allows for the registration of a requesting application with an out-of-process memory SR server object. While often slower than creating an instance of a private SR object, using the Sharing object can reduce strain on memory resources. The SR Sharing interface is an optional feature of speech engines and may not be available depending on the design of the engine itself. The SR Engine ObjectThe SR Engine Object is the heart of the speech
recognition system. This object represents the actual speech engine and it supports
several interfaces for the monitoring of speech activity. The SR Engine
is created using the Select method of the ISREnum interface of the SR Enumerator
object described earlier. Table 15.3 lists the interfaces supported by the SR Engine object along with a short description of their uses. The interfaces of the SR Engine object.
The SR Engine object also provides a notification callback interface (ISRNotifySink) to capture messages sent by the engine. These messages can be used to check on the performance status of the engine, and can provide feedback to the application (or speaker) that can be used to improve performance. The Grammar ObjectThe Grammar object is a child object of the SR Engine object. It is used to load parsing grammars for use by the speech engine in analyzing audio input. The Grammar object contains all the rules, words, lists, and other parameters that control how the SR engine interprets human speech. Each phrase detected by the SR engine is processed using the loaded grammars. The Grammar object supports three interfaces:
The Grammar object also supports a notification callback to handle messages regarding grammar events. Optionally, the grammar object can create an SR Results object. This object is discussed fully in the next section. The SR Results ObjectThe SR Results object contains detailed information about the most recent speech recognition event. This could include a recorded representation of the speech, the interpreted phrase constructed by the engine, the name of the speaker, performance statistics, and so on.
Table 15.4 shows the interfaces defined for the SR Results
object, along with descriptions of their use. Only the first interface in the table is
required (the ISRResBasic interface). The defined interfaces for the SR Results object.
Text-to-SpeechThe low-level text-to-speech services are provided by one primary object-the TTS Engine object. Like the SR object set, the TTS object set has an Enumerator object and an Engine Enumerator object. These objects are used to locate and select a valid TTS Engine object and are then discarded.
The TTS services also use an audio output object. The default object for output is the PC speakers, but this can be set to the telephone device. Applications can also create their own output devices, including the creation of a WAV format recording device as the output for TTS engine activity. The rest of this section discusses the details of the low-level SAPI TTS objects. The TTS Enumerator and Engine Enumerator ObjectsThe TTS Enumerator and Engine Enumerator objects are used to obtain a list of the available TTS engines and their speaking modes. They both support two interfaces:
Once the objects have provided a valid address to a TTS engine object, the TTS Enumerator and Engine Enumerator objects can be discarded. The TTS Engine ObjectThe TTS Engine object is the primary object of low-level SAPI TTS services. The Engine object supports several interfaces. Table 15.5 lists the interfaces used for the translations of text into audible speech. Table 15.5. The TTS Engine object interfaces.
In addition to the interfaces described in Table the TTS Engine object supports two notification callbacks:
Speech Objects and OLE AutomationMicrosoft supplies an OLE Automation type library with the Speech SDK. This type library can be used with any VBA-compliant software, including Visual Basic, Access, Excel, and others. The OLE Automation set provides high-level SAPI services only. The objects, properties, and methods are quite similar to the objects and interfaces provided by the high-level SAPI services described at the beginning of this Section. There are two type library files in the Microsoft Speech SDK:
We can load these libraries into a Visual Basic project by way of the Tools | References menu item.
OLE Automation Speech Recognition ServicesThe OLE Automation speech recognition services are implemented using two objects:
The OLE Voice Command object has three properties and two methods. Table 15.6 shows the Voice Command object's properties and methods, along with their parameters and short descriptions. The properties and methods of the OLE Voice Command object.
Using the Voice Command CallbackThe Voice Command type library provides a unique and very efficient method for registering callbacks using a Visual Basic 4.0 class module. In order to establish an automatic notification from the SR engine, all We need to do is add a VB4 class module to our application. This class module must have two functions created:
Listing shows how these two routines look in a class module.
The Voice Menu ObjectThe OLE Voice Menu object is used to add new
commands to the list of valid items that can be recognized by the SR engine. The Voice Menu object has two properties and three methods. Table
15.7 shows the Voice Menu object's methods and
properties, along with parameters and short descriptions. The properties and methods of the OLE Voice Menu object.
Using Command Lists with the Voice Menu ObjectThe Voice Menu object allows We to define a command that refers to a list. We can then load this list into the grammar using the ListSet method. For example, We can use the Add method to create a command to send e-mail messages. Then We can use the ListSet method to create a list of people to receive e-mail
OLE Automation Text-to-Speech ServicesWe can gain access to the OLE Automation TTS services using only one object-the Voice Text object. The Voice Text
object has four properties and seven methods. Table 15.8 shows the properties and methods,
along with their parameters and short descriptions. The properties and methods of the Voice Text object.
Using the Voice Text CallbackThe Voice Text type library provides a unique and very efficient method for registering callbacks using a Visual Basic 4.0 class module. In order to establish an automatic notification from the TTS engine, all We need to do is add a VB4 class module to our application. This class module must have two functions created:
Only VB4 applications can use this method of establishing callbacks through class modules. If We are using the TTS objects with other VBA-compatible languages, We need to set up a routine, using a timer, that will regularly poll the IsSpeaking property. The IsSpeaking property is set to TRUE while the TTS engine is speaking text. |
Send mail to askazad@hotmail.com with questions or comments about this web site.
|