Home Up Next                 

    My efforts are dedicated to my "Source of Inspiration..."

BASICS       anim1058.gif (13911 bytes)  
[ BASICS ] Architecture ] Text To Speech ] Speech Recognition ] Microphones ] Test Cases ] For OEM's ] Future ] SAPI4.0 ] Wow ]

 

Search for:

SAPI Hardware

Speech systems can be resource intensive. It is especially important that SR engines have enough RAM and disk space to respond quickly to user requests. Failure to respond quickly results in additional commands spoken into the system. This has the effect of creating a spiraling degradation in performance. The worse things get, the worse things get. It doesn't take too much of this before users decide oursoftware is more trouble than it's worth!

Text-to-speech engines can also tax the system. While TTS engines do not always require a great deal of memory to operate, insufficient processor speed can result in halting or unintelligible playback of text.

For these reasons, it is important to establish clear hardware and software requirements when designing and implementing Wer speech-aware and speech-enabled applications. Not all pcs will have the memory, disk space, and hardware needed to properly implement SR and TTS services. There are three general categories of workstation resources that should be reviewed:

General hardware, including processor speed and RAM memory
Software, including operating system and SR/TTS engines
Special hardware, including sound cards, microphones, speakers, and headphones

The following three sections provide some general guidelines to follow when establishing minimal resource requirements for our applications.

General Hardware Requirements

Speech systems can tax processor and RAM resources. SR services require varying levels of resources depending on the type of SR engine installed and the level of services implemented. TTS engine requirements are rather stable, but also depend on the TTS engine installed.

The SR and TTS engines currently available for SAPI systems usually can be successfully implemented using as little as a 486/33 processor chip and an additional 1MB of RAM. However, overall pc performance with this configuration is pretty poor and is not recommended. A good suggested processor is a Pentium processor (P60 or better) with at least 16MB of total RAM. Systems that will be supporting dictation SR services require the most computational power. It is not unreasonable to expect the workstation to use 32MB of RAM and a P100 or higher processor. Obviously, the more resources, the better the performance.

SR Processor and Memory Requirements

In general, SR systems that implement command and control services will only need an additional 1MB of RAM (not counting the application's RAM requirement). Dictation services should get at least another 8MB of RAM-preferably more. The type of speech sampling, analysis, and size of recognition vocabulary all affect the minimal resource requirements. Table 16.1 shows published minimal processor and RAM requirements of speech recognition services.

Table 16.1. Published minimal processor and RAM requirements of SR services.

Levels of Speech-Recognition Services

Minimal Processor

Minimal Additional RAM

Discrete, speaker-dependent, whole word, small vocabulary

386/16

64K

Discrete, speaker-independent, whole word, small vocabulary

386/33

256K

Continuous, speaker-independent, sub-word, small vocabulary

486/33

1MB

Discrete, speaker-dependent, whole word, large vocabulary Pentium

8MB

Continuous, speaker-independent, sub-word, large vocabulary RISC processor

8MB

 

These memory requirements are in addition to the requirements of the operating system and any loaded applications. The minimal Windows 95 memory model should be 12MB. Recommended RAM is 16MB and 24MB is preferred. The minimal NT memory should be 16MB with 24MB recommended and 32MB preferred.

TTS Processor and Memory Requirements

TTS engines do not place as much of a demand on workstation resources as SR engines. Usually TTS services only require a 486/33 processor and only 1MB of additional RAM. TTS programs themselves are rather small-about 150K. However, the grammar and prosody rules can demand as much as another 1MB depending on the complexity of the language being spoken. It is interesting to note that probably the most complex and demanding language for TTS processing is English. This is primarily due to the irregular spelling patterns of the language.

Most TTS engines use speech synthesis to produce the audio output. However, advanced systems can use diphone concatenation. Since diphone-based systems rely on a set of actual voice samples for reproducing written text, these systems can require an additional 1MB of RAM. To be safe, it is a good idea to suggest a requirement of 2MB of additional RAM, with a recommendation of 4MB for advanced TTS systems.

Software Requirements-Operating Systems and Speech Engines

The general software requirements are rather simple. The Microsoft Speech API can only be implemented on Windows 32-bit operating systems. This means you'll need Windows 95 or Windows NT 3.5 or greater on the workstation.

 

Note

All the testing and programming examples covered in this book have been performed using Windows 95. It is assumed that Windows NT systems will not require any additional modifications.

 

The most important software requirements for implementing speech services are the SR and TTS engines. An SR/TTS engine is the back-end processing module in the SAPI model. our application is the front end, and the SPEEch.DLL acts as the broker between the two processes.

The new wave of multimedia pcs usually has SR/TTS engines as part of their initial software package. For existing pcs, most sound cards now ship with SR/TTS engines.

Microsoft's Speech SDK does not include a set of SR/TTS engines. However, Microsoft does have an engine on the market. Their Microsoft Phone software system (available as part of modem/sound card packages) includes the Microsoft Voice SR/TTS engine. We can also purchase engines directly from third-party vendors.

 

Note

Refer to appendix B, "SAPI Resources," for a list of vendors that support the Speech API. We can also check the CD-ROM that ships with this book for the most recent list of SAPI vendors. Finally, the Microsoft Speech SDK contains a list of SAPI engine providers in the ENGINE.DOC file.

Special Hardware Requirements-Sound Cards, Microphones, and Speakers

Complete speech-capable workstations need three additional pieces of hardware:

A sound card for audio reproduction
Speakers for audio playback
A microphone for audio input

Just about any sound card can support SR/TTS engines. Any of the major vendors' cards are acceptable, including Sound Blaster and its compatibles, Media Vision, ESS technology, and others. Any card that is compatible with Microsoft's Windows Sound System is also acceptable.

Many vendors are now offering multifunction cards that provide speech, data, FAX, and telephony services all in one card. We can usually purchase one of these cards for about $250-$500. By installing one of these new cards, We can upgrade a workstation and reduce the number of hardware slots in use at the same time.

A few speech-recognition engines still need a DSP (digital signal processor) card. While it may be preferable to work with newer cards that do not require DSP handling, there are advantages to using DSP technology. DSP cards handle some of the computational work of interpreting speech input. This can actually reduce the resource requirements for providing SR services. In systems where speech is a vital source of process input, DSP cards can noticeably boost performance.

SR engines require the use of a microphone for audio input. This is usually handled by a directional microphone mounted on the pc base. Other options include the use of a lavaliere microphone draped around the neck, or a headset microphone that includes headphones. Depending on the audio card installed, We may also be able to use a telephone handset for input.

Most multimedia systems ship with a suitable microphone built into the pc or as an external device that plugs into the sound card. It is also possible to purchase high-grade unidirectional microphones from audio retailers. Depending on the microphone and the sound card used, We may need an amplifier to boost the input to levels usable by the SR engine.

The quality of the audio input is one of the most important factors in successful implementation of speech services on a pc. If the system will be used in a noisy environment, close-talk microphones should be used. This will reduce extraneous noise and improve the recognition capabilities of the SR engine.

Speakers or headphones are needed to play back TTS output. In private office spaces, free-standing speakers provide the best sound reproduction and fewest dangers of ear damage through high-levels of playback. However, in larger offices, or in areas where the playback can disturb others, headphones are preferred.

 

Tip

As mentioned earlier in this Section, some systems can also provide audio playback through a telephone handset. Conversely, the use of free-standing speakers and a microphone can be used successfully as a speaker-phone system.

Technology Issues

As advanced as SR/TTS technology is, it still has its limits. This section covers the general technology issues for SR and TTS engines along with a quick summary of some of the limits of the process and how this can affect perceived performance and system design.

SR Techniques

Speech recognition technology can be measured by three factors:

Word selection
Speaker dependence
Word analysis

Word selection deals with the process of actually perceiving "word items" as input. Any speech engine must have some method for listening to the input stream and deciding when a word item has been uttered. There are three different methods for selecting words from the input stream. They are:

Discrete speech
Word spotting
Continuous speech

Discrete speech is the simplest form of word selection. Under discrete speech, the engine requires a slight pause between each word. This pause marks the beginning and end of each word item. Discrete speech requires the least amount of computational resources. However, discrete speech is not very natural for users. With a discrete speech system, users must speak in a halting voice. This may be adequate for short interactions with the speech system, but rather annoying for extended periods.

A much more preferred method of handling speech input is word spotting. Under word spotting, the speech engine listens for a list of key words along the input stream. This method allows users to use continuous speech. Since the system is "listening" for key words, users do not need to use unnatural pauses while they speak. The advantage of word spotting is that it gives users the perception that the system is actually listening to every word while limiting the amount of resources required by the engine itself. The disadvantage of word spotting is that the system can easily misinterpret input. For example, if the engine recognizes the word run, it will interpret the phrases "Run Excel" and "Run Access" as the same phrase. For this reason, it is important to design vocabularies for word-spotting systems that limit the possibility of confusion.

The most advanced form of word selection is the continuous speech method. Under continuous speech, the SR engine attempts to recognize each word that is uttered in real time. This is the most resource-intensive of the word selection methods. For this reason, continuous speech is best reserved for dictation systems that require complete and accurate perception of every word.

The process of word selection can be affected by the speaker. Speaker dependence refers to the engine's ability to deal with different speakers. Systems can be speaker dependent, speaker independent, or speaker adaptive. The disadvantage of speaker-dependent systems is that they require extensive training by a single user before they become very accurate. This training can last as much as one hour before the system has an accuracy rate of over 90 percent. Another drawback to speaker-dependent systems is that each new user must re-train the system to reduce confusion and improve performance. However, speaker-dependent systems provide the greatest degree of accuracy while using the least amount of computing resources.

Speaker-adaptive systems are designed to perform adequately without training, but they improve with use. The advantage of speaker-adaptive systems is that users experience success without tedious training. Disadvantages include additional computing resource requirements and possible reduced performance on systems that must serve different people.

Speaker-independent systems provide the greatest degree of accuracy without performance. Speaker-independent systems are a must for installations where multiple speakers need to use the same station. The drawback of speaker-independent systems is that they require the greatest degree of computing resources.

Once a word item has been selected, it must be analyzed. Word analysis techniques involve matching the word item to a list of known words in the engine's vocabulary. There are two methods for handling word analysis: whole-word matching or sub-word matching. Under whole-word matching, the SR engine matches the word item against a vocabulary of complete word templates. The advantage of this method is that the engine is able to make an accurate match very quickly, without the need for a great deal of computing power. The disadvantage of whole-word matching is that it requires extremely large vocabularies-into the tens of thousands of entries. Also, these words must be stored as spoken templates. Each word can require as much as 512 bytes of storage.

An alternate word-matching method involves the use of sub-words called phonemes. Each language has a fixed set of phonemes that are used to build all words. By informing the SR engine of the phonemes and their representations it is much easier to recognize a wider range of words. Under sub-word matching, the engine does not require an extensive vocabulary. An additional advantage of sub-word systems is that the pronunciation of a word can be determined from printed text. Phoneme storage requires only 5 to 20 bytes per phoneme. The disadvantage of sub-word matching is that is requires more processing resources to analyze input.

SR Limits

It is important to understand the limits of current SR technology and how these limits affect system performance. Three of the most vital limitations of current SR technology are:

Speaker identification
Input recognition
Recognition accuracy

The first hurdle for SR engines is determining when the speaker is addressing the engine and when the words are directed to someone else in the room. This skill is beyond the SR systems currently on the market. our program must allow users to inform the computer that We are addressing the engine. Also, SR engines cannot distinguish between multiple speakers. With speaker-independent systems, this is not a big problem. However, speaker-dependent systems cannot deal well in situations where multiple users may be addressing the same system.

Even speaker-independent systems can have a hard time when multiple speakers are involved. For example, a dictation system designed to transcribe a meeting will not be able to differentiate between speakers. Also, SR systems fail when two people are speaking at the same time.

SR engines also have limits regarding the processing of identified words. First, SR engines have no ability to process natural language. They can only recognize words in the existing vocabulary and process them based on known grammar rules. Thus, despite any perceived "friendliness" of speech-enabled systems, they do not really understand the speaker at all.

SR engines also are unable to hear a new word and derive its meaning from previously spoken words. The system is incapable of spelling or rendering words that are not already in its vocabulary.

Finally, SR engines are not able to deal with wide variations in pronunciation of the same word. For example, words such as either (ee-ther or I-ther) and potato (po-tay-toe or po-tah-toe) can easily confuse the system. Wide variations in pronunciation can greatly reduce the accuracy of SR systems.

Recognition accuracy can be affected by regional dialects, quality of the microphone, and the ambient noise level during a speech session. Much like the problem with pronunciation, dialect variations can hamper SR engine performance. If our software is implemented in a location where the common speech contains local slang or other region-specific words, these words may be misinterpreted or not recognized at all.

Poor microphones or noisy office spaces also affect accuracy. A system that works fine in a quiet, well-equipped office may be unusable in a noisy facility. In a noisy environment, the SR engine is more likely to confuse similar-sounding words such as out and pout, or in and when. For this reason it is important to emphasize the value of a good microphone and a quiet environment when performing SR activities.

TTS Techniques

TTS engines use two different techniques for turning text input into audio output-synthesis or diphone concatenation. Synthesis involves the creation of human speech through the use of stored phonemes. This method results in audio output that is understandable, but not very human-like. The advantage of synthesis systems is that they do not require a great deal of storage space to implement and that they allow for the modification of voice quality through the adjustment of only a few parameters.

Diphone-based systems produce output that is much closer to human speech. This is because the system stores actual human speech phoneme sets and plays them back. The disadvantage of this method is that it requires more computing and storage capacity. However, if our application is used to provide long sessions of audio output, diphone systems produce a speech quality much easier to understand.

TTS Limits

TTS engines are limited in their ability to re-create the details of spoken language, including the rhythm, accent, and pitch inflection. This combination of properties is call the prosody of speech. TTS engines are not very good at adding prosody. For this reason, listening to TTS output can be difficult, especially for long periods of time. Most TTS engines allow users to edit text files with embedded control information that adds prosody to the ASCII text. This is useful for systems that are used to "read" text that is edited and stored for later retrieval.

TTS systems have their limits when it comes to producing individualized voices. Synthesis-based engines are relatively easy to modify to create new voice types. This modification involves the adjustment of general pitch and speed to produce new vocal personalities such as "old man," "child," "female," "male," and so on. However, these voices still use the same prosody and grammar rules.

Creating new voices for diphone-based systems is much more costly than for synthesis-based systems. Since each new vocal personality must be assembled from pre-recorded human speech, it can take quite a bit of time and effort to alter an existing voice set or to produce a new one. Diphone concatenation is costly for systems that must support multiple languages or need to provide flexibility in voice personalities.

General SR Design Issues

There are a number of general issues to keep in mind when designing SR interfaces to our applications.

First, if We provide speech services within our application, you'll need to make sure We let the user know the services are available. This can be done by adding a graphic image to the display, telling the user that the computer is "listening," or We can add caption or status items that indicate the current state of the SR engine.

It is also a good idea to make speech services an optional feature whenever possible. Some installations may not have the hardware or RAM required to implement speech services. Even if the workstation has adequate resources, the user may experience performance degradation with the speech services active. It is a good idea to have a menu option or some other method that allows users to turn off speech services entirely.

When We add speech services to our programs, it is important to make sure We give users realistic expectations regarding the capabilities of the installation. This is best done through user documentation. We needn't go into great length, but We should give users general information about the state of SR technology, and make sure users do not expect to carry on extensive conversations with their new "talking electronic pal."

Along with indications that speech services are active, it is a good idea to provide users with a single speech command that displays a list of recognized speech inputs, and some general online help regarding the use and capabilities of the SR services of our program. Since the total number of commands might be quite large, We may want to provide a type of voice-activated help system that allows users to query the current command set and then ask additional questions to learn more about the various speech commands they can use.

It is also a good idea to add confirmations to especially dangerous or ambiguous speech commands. For example, if We have a voice command for "Delete," We should ask the user to confirm this option before continuing. This is especially important if We have other commands that may sound similar-if We have both "Delete" and "Repeat" in the command list We will want to make sure the system is quite sure which command was requested.

In general, it is a good idea to display the status of all speech processing. If the system does not understand a command, it is important to tell users rather than making them sit idle while our program waits for understandable input. If the system cannot identify a command, display a message telling the user to repeat the command, or bring up a dialog box that lists likely possibilities from which the user can select the requested command.

In some situations, background noise can hamper the performance of the SR engine. It is advisable to allow users to turn off speech services and only turn them back on when they are needed. This can be handled through a single button press or menu selection. In this way, stray noise will not be misinterpreted as speech input.

There are a few things to avoid when adding voice commands to an application. SR systems are not very successful when processing long series of numbers or single letters. "M" and "N" sound quite alike, and long lists of digits can confuse most SR systems. Also, although SR systems are capable of handling requests such as "move mouse left," "move mouse right," and so on, this is not a good use of voice technology. Using voice commands to handle a pointer device is a bit like using the keyboard to play musical notes. It is possible, but not desirable.

Voice Command Menu Design

The key to designing good command menus is to make sure they are complete, consistent, and that they contain unique commands within the set. Good command menus also contain more than just the list of items displayed on the physical menu. It is a good idea to think of voice commands as We would keyboard shortcuts.

Useful voice command menus will provide access to all the common operations that might be performed by the user. For example, the standard menu might offer a top-level menu option of Help. Under the Help menu might be an About item to display the basic information about the loaded application. It makes sense to add a voice command that provides direct access to the About box with a Help About command.

These shortcut commands may span several menu levels or even stand independent of any existing menu. For example, in an application that is used to monitor the status of manufacturing operations within a plant, We might add a command such as Display Statistics that would gather data from several locations and present a graph onscreen.

When designing menus, be sure to include commands for all dialog boxes. It is not a good idea to provide voice commands for only some dialog boxes and not for others.

 

Tip

We do not have to create menu commands for Windows-supplied dialog boxes (the Common Dialogs, the Message Box, and so on). Windows automatically supplies voice commands for these dialogs.

 

Be sure to include voice commands for the list and combo boxes within a dialog box, as well as the command buttons, check boxes, and option buttons.

In addition to creating menus for all the dialog boxes of our applications, We should consider creating a "global" menu that is active as long as the application is running. This would allow users to execute common operations such as Get New Mail or Display Status Log without having to first bring the application into the foreground.

 

Tip

It is advisable to limit this use of speech services to only a few vital and unique commands since any other applications that have speech services may also activate global commands.

 

It is also important to include common alternate wordings for commonly used operations, such as Get New Mail and Check for New Mail, and so on. Although We may not be able to include all possible alternatives, adding a few will greatly improve the accessibility of our speech interface.

Use consistent word order in our menu design. For example, for action commands We should use the verb-noun construct, as in Save File or Check E-Mail. For questions, use a consistent preface such as How do I… or Help Me…, as in How do I check e-mail? or Help me change font. It is also important to be consistent with the use of singular and plural. In the above example, We must be sure to use Font or Fonts throughout the application.

Since the effectiveness of the SR engine is determined by its ability to identify our voice input against a list of valid words, We can increase the accuracy of the SR engine by keeping the command lists relatively short. When a command is spoken, the engine will scan the list of valid inputs in this state and select the most likely candidate. The more words on the list, the greater the chance the engine will select the wrong command. By limiting the list, We can increase the odds of a correct "hit."

Finally, We can greatly increase the accuracy of the SR engine by avoiding similar-sounding words in commands. For example, repeat and delete are dangerously similar. Other words that are easily confused are go and no, and even on and off. We can still use these words in our application if We use them in separate states. In other words, do not use repeat in the same set of menu options as delete.

TTS Design Issues

There are a few things to keep in mind when adding text-to-speech services to our applications. First, make sure We design our application to offer TTS as an option, not as a required service. our application may be installed on a workstation that does not have the required resources, or the user may decided to turn off TTS services to improve overall performance. For this reason, it is also important to provide visual as well as aural feedback for all major operations. For example, when processing is complete, it is a good idea to inform the user with a dialog box as well as a spoken message.

Because TTS engines typically produce a voice that is less than human-like, extended sessions of listening to TTS output can be tiring to users. It is a good idea to limit TTS output to short phrases. For example, if our application gathers status data on several production operations on the shop floor, it is better to have the program announce the completion of the process (for example, Status report complete) instead of announcing the details of the findings. Alternatively, our TTS application could announce a short summary of the data (for example, All operations on time and within specifications).

If our application must provide extended TTS sessions We should consider using pre-recorded WAV files for output. For example, if our application should allow users aural access to company regulations or documentation, it is better to record a person reading the documents, and then play back these recordings to users upon request. Also, if our application provides a limited set of vocal responses to the user, it is advisable to use WAV recordings instead of TTS output. A good example of this would be telephony applications that ask users questions and respond with fixed answers.

Finally, it is not advisable to mix WAV output and TTS output in the same session. This highlights the differences between the quality of recorded voice and computer-generated speech. Switching between WAV and TTS can also make it harder for users to understand the TTS voice since they may be expecting a familiar recorded voice and hear computer-generated TTS instead.

Home ] Up ] [ BASICS ] Architecture ] Text To Speech ] Speech Recognition ] Microphones ] Test Cases ] For OEM's ] Future ] SAPI4.0 ] Wow ]

 

 

Please sign my guest book:

Send mail to askazad@hotmail.com with questions or comments about this web site.
Copyright © 2001 Engineered Station
Last modified: July 06, 2001

This site is been visited by Hit Counter    surfers