Back Home Up Next                 

    My efforts are dedicated to my "Source of Inspiration..."

SAPI       anim1058.gif (13911 bytes)  
Research Utelization ] WOSA ] [ SAPI ] TAPI ] MAPI ] ICA ] Operating Systems ] Prototype ]


Search for:

BASICS Architecture Text To Speech Speech Recognition Microphones Test Cases For OEM's Future SAPI4.0 Wow

speech recognition software
  << Previous  | Join  | Next >>  

Speech Application Programming Interface:

One of the newest extensions for the Windows 95 operating system is the Speech Application Programming Interface (SAPI). This Windows extension gives workstations the ability to recognize human speech as input, and create human-like audio output from printed text. This ability adds a new dimension to human/pc interaction. Speech recognition services can be used to extend the use of PCs to those who find typing too difficult or too time-consuming. Text-to-speech services can be used to provide aural representations of text documents to those who cannot see typical display screens because of physical limitations or due to the nature of their work.

Like the other Windows services described in this site, SAPI is part of the Windows Open Services Architecture (WOSA) model. Speech recognition (SR) and text-to-speech (TTS) services are actually provided by separate modules called engines. Users can select the speech engine they prefer to use as long as it conforms to the SAPI interface.

In this Section you'll learn the basic concepts behind designing and implementing a speech recognition and text-to-speech engine using the SAPI design model. You'll also learn about creating grammar definitions for speech recognition.

Speech Recognition

Any speech system has, at its heart, a process for recognizing human speech and turning it into something the computer understands. In effect, the computer needs a translator. Research into effective speech recognition algorithms and processing models has been going on almost ever since the computer was invented. And a great deal of mathematics and linguistics go into the design and implementation of a speech recognition system. A detailed discussion of speech recognition algorithms is beyond the scope of our project, but it is important to have a good idea of the commonly used techniques for turning human speech into something a computer understands.

Every speech recognition system uses four key operations to listen to and understand human speech. They are:

Word separation-This is the process of creating discreet portions of human speech. Each portion can be as large as a phrase or as small as a single syllable or word part.

Vocabulary-This is the list of speech items that the speech engine can identify.

Word matching-This is the method that the speech system uses to look up a speech part in the system's vocabulary-the search engine portion of the system.

Speaker dependence-This is the degree to which the speech engine is dependent on the vocal tones and speaking patterns of individuals.

These four aspects of the speech system are closely interrelated. If we want to develop a speech system with a rich vocabulary, we'll need a sophisticated word matching system to quickly search the vocabulary. Also, as the vocabulary gets larger, more items in the list could sound similar (for example, yes and yet). In order to successfully identify these speech parts, the word separation portion of the system must be able to determine smaller and smaller differences between speech items.

Finally, the speech engine must balance all of these factors against the aspect of speaker dependence. As the speech system learns smaller and smaller differences between words, the system becomes more and more dependent on the speaking habits of a single user. Individual accents and speech patterns can confuse speech engines. In other words, as the system becomes more responsive to a single user, that same system becomes less able to translate the speech of other users.

The next few sections describe each of the four aspects of a speech engine in a bit more detail.

Word Separation

The first task of the speech engine is to accept words as input. Speech engines use a process called word separation to gather human speech. Just as the keyboard is used as an input device to accept physical keystrokes for translation into readable characters, the process of word separation accepts the sound of human speech for translation by the computer.

There are three basic methods of word separation. In ascending order of complexity they are:

Discrete speech

Word spotting

Continuous speech

Systems that use the discrete speech method of word separation require the user to place a short pause between each spoken word. This slight bit of silence allows the speech system to recognize the beginning and ending of each word. The silences separate the words much like the space bar does when we type. The advantage of the discrete speech method is that it requires the least amount of computational resources. The disadvantage of this method is that it is not very user-friendly. Discrete speech systems can easily become confused if a person does not pause between words.

Systems that use word spotting avoid the need for users to pause in between each word by listening only for key words or phrases. Word spotting systems, in effect, ignore the items they do not know or care about and act only on the words they can match in their vocabulary. For example, suppose the speech system can recognize the word help, and knows to load the Windows Help engine whenever it hears the word. Under word spotting, the following phrases will all result in the speech engine invoking Windows Help:

Please load Help.
Can you help me, please?
These definitions are no help at all!

As we can see, one of the disadvantages of word spotting is that the system can easily misinterpret the user's meaning. However, word spotting also has several key advantages. Word spotting allows users to speak normally, without employing pauses. Also, since word spotting systems simply ignore words they don't know and act only on key words, these systems can give the appearance of being more sophisticated than they really are. Word spotting requires more computing resources than discreet speech, but not as much as the last method of word separation-continuous speech.

Continuous speech systems recognize and process every word spoken. This gives the greatest degree of accuracy when attempting to understand a speaker's request. However, it also requires the greatest amount of computing power. First, the speech system must determine the start and end of each word without the use of silence. This is much like readingtextthathasnospacesinit (see!). Once the words have been separated, the system must look them up in the vocabulary and identify them. This, too, can take precious computing time. The primary advantage of continuous speech systems is that they offer the greatest level of sophistication in recognizing human speech. The primary disadvantage is the amount of computing resources they require.

Speaker Dependence

Speaker dependence is a key factor in the design and implementation of a speech recognition system. In theory, we would like a system that has very little speaker dependence. This would mean that the same workstation could be spoken to by several people with the same positive results. People often speak quite differently from one another, however, and this can cause problems.

First, there is the case of accents. Just using the United States as an example, we can identify several regional sounds. Add to these the possibility that speakers may also have accents that come from outside the U.S. due to the influence of other languages (Spanish, German, Japanese), and we have a wide range of pronunciation for even the simplest of sentences. Speaker speed and pitch inflection can also vary widely, which can pose problems for speech systems that need to determine whether a spoken phrase is a statement or a question.

Speech systems fall into three categories in terms of their speaker dependence. They can be:

Speaker independent

Speaker dependent

Speaker adaptive

Speaker-independent systems require the most resources. They must be able to accurately translate human speech across as many dialects and accents as possible. Speaker-dependent systems require the least amount of computing resources. These systems require that the user "train" the system before it is able to accurately convert human speech. A compromise between the two approaches is the speaker-adaptive method. Speaker-adaptive systems are prepared to work without training, but increase their accuracy after working with the same speaker for a period of time.

The additional training required by speaker-dependent systems can be frustrating to users. Usually training can take several hours, but some systems can reach 90 percent accuracy or better after just five minutes of training. Users with physical disabilities, or those who find typing highly inefficient, will be most likely to accept using speaker-dependent systems.

Systems that will be used by many different people need the power of speaker independence. This is especially true for systems that will have short encounters with many different people, such as greeting kiosks at an airport. In such situations, training is unlikely to occur, and a high degree of accuracy is expected right away.

For systems where multiple people will access the same workstation over a longer period of time, the speaker-adaptive system will work fine. A good example would be a workstation used by several employees to query information from a database. The initial investment spent training the speech system will pay off over time as the same staff uses the system.

Word Matching

Word matching is the process of performing look-ups into the speech database. As each word is gathered (using the word separation techniques described earlier), it must be matched against some item in the speech engine's database. It is the process of word matching that connects the audio input signal to a meaningful item in the speech engine database.

There are two primary methods of word matching:

Whole-word matching

Phoneme matching

Under whole-word matching, the speech engine searches the database for a word that matches the audio input. Whole-word matching requires less search capability than phoneme matching. But, whole-word matching requires a greater amount of storage capacity. Under the whole-word matching model, the system must store a word template that represents each possible word that the engine can recognize. While quick retrieval makes whole-word matching attractive, the fact that all words must be known ahead of time limits the application of whole-word matching systems.

Phoneme matching systems keep a dictionary of language phonemes. Phonemes are the smallest unique sound part of a language, and can be numerous. For example, while the English language has 26 individual letters, these letters do not represent the total list of possible phonemes. Also, phonemes are not restricted by spelling conventions.

Consider the words Philip and fill up. These words have the same phonemes: f, eh, ul, ah, and pah. However, they have entirely different meanings. Under the whole-word matching model, these words could represent multiple entries in the database. Under the phoneme matching model, the same five phonemes can be used to represent both words.

As we may expect, phoneme matching systems require more computational resources, but less storage space.


The final element of a speech recognition system is the vocabulary. There are two competing issues regarding vocabulary: size and accuracy. As the vocabulary size increases, recognition improves. With large vocabularies, it is easy for speech systems to locate a word that matches the one identified in the word separation phase. However, one of the reasons it is easy to find a match is that more than one entry in the vocabulary may match the given input. For example, the words no and go are very similar to most speech engines. Therefore, as vocabulary size grows, the accuracy of speech recognition can decrease.

Contrary to what we might assume, a speech engine's vocabulary does not represent the total number of words it understands. Instead, the vocabulary of a speech engine represents the number of words that it can recognize in a current state or moment in time. In effect, this is the total number of "unidentified" words that the system can resolve at any moment.

For example, let's assume we have registered the following word phrases with our speech engine: "Start running Exchange" and "Start running Word." Before we say anything, the current state of the speech engine has four words: start, running, Exchange, and Word. Once we say "Start running" there are only two words in the current state: Exchange and Word. The system's ability to keep track of the possible next word is determined by the size of its vocabulary.

Small vocabulary systems (100 words or less) work well in situations where most of the speech recognition is devoted to processing commands. However, we need a large vocabulary to handle dictation systems. Dictation vocabularies can reach into tens of thousands of words. This is one of the reasons that dictation systems are so difficult to implement. Not only does the vocabulary need to be large, the resolutions must be made quite quickly.


A second type of speech service provides the ability to convert written text into spoken words. This is called text-to-speech (or TTS) technology. Just as there are a number of factors to consider when developing speech recognition engines (SR), there are a few issues that must be addressed when creating and implementing rules for TTS engines.

The four common issues that must be addressed when creating a TTS engine are as follows:


Voice quality

TTS synthesis

TTS diphone concatenation

The first two factors deal with the creation of audio tones that are recognizable as human speech. The last two items are competing methods for interpreting text that is to be converted into audio.

Voice Quality

The quality of a computerized voice is directly related to the sophistication of the rules that identify and convert text into an audio signal. It is not too difficult to build a TTS engine that can create recognizable speech. However, it is extremely difficult to create a TTS engine that does not sound like a computer. Three factors in human speech are very difficult to produce with computers:



Pronunciation anomalies

Human speech has a special rhythm or prosody-a pattern of pauses, inflections, and emphasis that is an integral part of the language. While computers can do a good job of pronouncing individual words, it is difficult to get them to accurately mimic the tonal and rhythmic in-flections of human speech. For this reason, it is always quite easy to differentiate computer-generated speech from a computer playing back a recording of a human voice.

Another factor of human speech that computers have difficulty rendering is emotion. While TTS engines are capable of distinguishing declarative statements from questions or exclamations, computers are still not able to convey believable emotive qualities when rendering text into speech.

Lastly, every language has its own pronunciation anomalies. These are words that do not "play by the rules" when it comes to converting text into speech. Some common examples in English are dough and tough or comb and home. More troublesome are words such as read which must be understood in context in order to figure out their exact pronunciation. For example, the pronunciations are different in "He read the paper" or "She will now read to the class." Even more likely to cause problems is the interjection of technobabble such as "SQL," "MAPI," and "SAPI." All these factors make the development of a truly human-sounding computer-generated voice extremely difficult.

Speech systems usually offer some way to correct for these types of problems. One typical solution is to include the ability to enter the phonetic spelling of a word and relate that spelling to the text version. Another common adjustment is to allow users to enter control tags in the text to instruct the speech engine to add emphasis or inflection, or alter the speed or pitch of the audio output. Much of this type of adjustment information is based on phonemes, as described in the next section.


As we've discussed, phonemes are the sound parts that make up words. Linguists use phonemes to accurately record the vocal sounds uttered by humans when speaking. These same phonemes also can be used to generate computerized speech. TTS engines use their knowledge of grammar rules and phonemes to scan printed text and generate audio output.



If we are interested in learning more about phonemes and how they are used to analyze speech, refer to the Phonetic Symbol Guide by Pullum and Ladusaw (Chicago University Press, 1996).


The SAPI design model recognizes and allows for the incorporation of phonemes as a method for creating speech output. Microsoft has developed an expression of the International Phonetic Alphabet (IPA) in the form of Unicode strings. Programmers can use these strings to improve the pronunciation skills of the TTS engine, or to add entirely new words to the vocabulary.



If we wish to use direct Unicode to alter the behavior of our TTS engine, you'll have to program using Unicode. SAPI does not support the direct use of phonemes in ANSI format.


As mentioned in the previous section on voice quality, most TTS engines provide several methods for improving the pronunciation of words. Unless we are involved in the development of a text-to-speech engine, we probably will not use phonemes very often.

TTS Synthesis

Once the TTS knows what phonemes to use to reproduce a word, there are two possible methods for creating the audio output: synthesis or diphone concatenation.

The synthesis method uses calculations of a person's lip and tongue position, the force of breath, and other factors to synthesize human speech. This method is usually not as accurate as the diphone method. However, if the TTS uses the synthesis method for generating output, it is very easy to modify a few parameters and then create a new "voice."

Synthesis-based TTS engines require less overall computational resources, and less storage capacity. Synthesis-based systems are a bit more difficult to understand at first, but usually offer users the ability to adjust the tone, speed, and inflection of the voice rather easily.

TTS Diphone Concatenation

The diphone concatenation method of generating speech uses pairs of phonemes (di meaning two) to produce each sound. These diphones represent the start and end of each individual speech part. For example, the word pig contains the diphones silence-p, p-i, i-g, and g-silence. Diphone TTS systems scan the word and then piece together the correct phoneme pairs to pronounce the word.

These phoneme pairs are produced not by computer synthesis, but from actual recordings of human voices that have been broken down to their smallest elements and categorized into the various diphone pairs. Since TTS systems that use diphones are using elements of actual human speech, they can produce much more human-like output. However, since diphone pairs are very language-specific, diphone TTS systems are usually dedicated to producing a single language. Because of this, diphone systems do not do well in environments where numerous foreign words may be present, or where the TTS might be required to produce output in more than one language.

Grammar Rules

The final elements of a speech engine are the grammar rules. Grammar rules are used by speech recognition (SR) software to analyze human speech input and, in the process, attempt to understand what a person is saying. Most of us suffered through a series of lessons in grade school where our teachers attempted to show us just how grammar rules affect our everyday speech patterns. And most of us probably don't remember a great deal from those lessons, but we all use grammar rules every day without thinking about them, to express ourselves and make sense of what others say to us. Without an understanding of and appreciation for the importance of grammars, computer speech recognition systems would not be possible.

There can be any number of grammars, each composed of a set of rules of speech. Just as humans must learn to share a common grammar in order to be understood, computers must also share a common grammar with the speaker in order to convert audio information into text.

Grammars can be divided in to three types, each with its own strengths and weaknesses. The types are:

Context-free grammars

Dictation grammars

Limited domain grammars

Context-free grammars offer the greatest degree of flexibility when interpreting human speech. Dictation grammars offer the greatest degree of accuracy when converting spoken words into printed text. Limited domain grammars offer a compromise between the highly flexible context-free grammar and the restrictive dictation grammar.

The following sections discuss each grammar type in more detail.

Context-Free Grammars

Context-free grammars work on the principle of following established rules to determine the most likely candidates for the next word in a sentence. Context-free grammars do not work on the idea that each word should be understood within a context. Rather, they evaluate the relationship of each word and word phrase to a known set of rules about what words are possible at any given moment.

The main elements of a context-free grammar are:

Words-A list of valid words to be spoken

Rules-A set of speech structures in which words are used

Lists-One or more word sets to be used within rules

Context-free grammars are good for systems that have to deal with a wide variety of input. Context-free systems are also able to handle variable vocabularies. This is because most of the rule-building done for context-free grammars revolves around declaring lists and groups of words that fit into common patterns or rules. Once the SR engine understands the rules, it is very easy to expand the vocabulary by expanding the lists of possible members of a group.

For example, rules in a context-free grammar might look something like this:


<SendMailRule>=("Send Email to", <NameRule>)

In the example above, two rules have been established. The first rule, <NameRule>, creates a list of possible names. The second rule, <SendMailRule>, creates a rule that depends on <NameRule>. In this way, context-free grammars allow us to build our own grammatical rules as a predictor of how humans will interact with the system.

Even more importantly, context-free grammars allow for easy expansion at run-time. Since much of the way context-free grammars operate focuses on lists, it is easy to allow users to add list members and, therefore, to improve the value of the SR system quickly. This makes it easy to install a system with only basic components. The basic system can be expanded to meet the needs of various users. In this way, context-free grammars offer a high degree of flexibility with very little development cost or complication.

The construction of quality context-free grammars can be a challenge, however. Systems that only need to do a few things (such as load and run programs, execute simple directives, and so on) are easily expressed using context-free grammars. However, in order to perform more complex tasks or a wider range of chores, additional rules are needed. As the number of rules and the length of lists increases, the computational load rises dramatically. Also, since context-free grammars base their predictions on predefined rules, they are not good for tasks like dictation, where a large vocabulary is most important.

Dictation Grammars

Unlike context-free grammars that operate using rules, dictation grammars base their evaluations on vocabulary. The primary function of a dictation grammar is to convert human speech into text as accurately as possible. In order to do this, dictation grammars need not only a rich vocabulary to work from, but also a sample output to use as a model when analyzing speech input. Rules of speech are not important to a system that must simply convert human input into printed text.

The elements of a dictation grammar are:

Topic-Identifies the dictation topic (for example, medical or legal).

Common-A set of words commonly used in the dictation. Usually the common group contains technical or specialized words that are expected to appear during dictation, but are not usually found in regular conversation.

Group-A related set of words that can be expected, but that are not directly related to the dictation topic. The group usually has a set of words that are expected to occur frequently during dictation. The grammar model can contain more than one group.

Sample-A sample of text that shows the writing style of the speaker or general format of the dictation. This text is used to aid the SR engine in analyzing speech input.

The success of a dictation grammar depends on the quality of the vocabulary. The more items on the list, the greater the chance of the SR engine mistaking one item for another. However, the more limited the vocabulary, the greater the number of "unknown" words that will occur during the course of the dictation. The most successful dictation systems balance vocabulary depth and the uniqueness of the words in the database. For this reason, dictation systems are usually tuned for one topic, such as legal or medical dictation. By limiting the vocabulary to the words most likely to occur in the course of dictation, translation accuracy is increased.

Limited Domain Grammars

Limited domain grammars offer a compromise between the flexibility of context-free grammars and the accuracy of dictation grammars. Limited domain grammars have the following elements:

Words-This is the list of specialized words that are likely to occur during a session.

Group-This is a set of related words that could occur during the session. The grammar can contain multiple word groups. A single phrase would be expected to include one of the words in the group.

Sample-A sample of text that shows the writing style of the speaker or general format of the dictation. This text is used to aid the SR engine in analyzing the speech input.

Limited domain grammars are useful in situations where the vocabulary of the system need not be very large. Examples include systems that use natural language to accept command statement, such as "How can I set the margins?" or "Replace all instances of 'New York' with 'Los Angeles.'" Limited domain grammars also work well for filling in forms or for simple text entry.

Home ] Up ] Research Utelization ] WOSA ] [ SAPI ] TAPI ] MAPI ] ICA ] Operating Systems ] Prototype ]



Please sign my guest book:

Send mail to with questions or comments about this web site.
Copyright 2001 Engineered Station
Last modified: July 30, 2001

This site is been visited by Hit Counter    surfers