Speech Application Programming Interface:
One of the newest extensions for the Windows 95 operating system is the Speech Application Programming Interface (SAPI). This Windows extension gives workstations the ability to recognize human speech as input, and create human-like audio output from printed text. This ability adds a new dimension to human/pc interaction. Speech recognition services can be used to extend the use of PCs to those who find typing too difficult or too time-consuming. Text-to-speech services can be used to provide aural representations of text documents to those who cannot see typical display screens because of physical limitations or due to the nature of their work.
Like the other Windows services described in this site, SAPI is part of the Windows Open Services Architecture (WOSA) model. Speech recognition (SR) and text-to-speech (TTS) services are actually provided by separate modules called engines. Users can select the speech engine they prefer to use as long as it conforms to the SAPI interface.
In this Section you'll learn the basic concepts behind designing and implementing a speech recognition and text-to-speech engine using the SAPI design model. You'll also learn about creating grammar definitions for speech recognition.
Any speech system has, at its heart, a process for recognizing human speech and turning it into something the computer understands. In effect, the computer needs a translator. Research into effective speech recognition algorithms and processing models has been going on almost ever since the computer was invented. And a great deal of mathematics and linguistics go into the design and implementation of a speech recognition system. A detailed discussion of speech recognition algorithms is beyond the scope of our project, but it is important to have a good idea of the commonly used techniques for turning human speech into something a computer understands.
Every speech recognition system uses four key operations to listen to and understand human speech. They are:
These four aspects of the speech system are closely interrelated. If we want to develop a speech system with a rich vocabulary, we'll need a sophisticated word matching system to quickly search the vocabulary. Also, as the vocabulary gets larger, more items in the list could sound similar (for example, yes and yet). In order to successfully identify these speech parts, the word separation portion of the system must be able to determine smaller and smaller differences between speech items.
Finally, the speech engine must balance all of these factors against the aspect of speaker dependence. As the speech system learns smaller and smaller differences between words, the system becomes more and more dependent on the speaking habits of a single user. Individual accents and speech patterns can confuse speech engines. In other words, as the system becomes more responsive to a single user, that same system becomes less able to translate the speech of other users.
The next few sections describe each of the four aspects of a speech engine in a bit more detail.
The first task of the speech engine is to accept words as input. Speech engines use a process called word separation to gather human speech. Just as the keyboard is used as an input device to accept physical keystrokes for translation into readable characters, the process of word separation accepts the sound of human speech for translation by the computer.
There are three basic methods of word separation. In ascending order of complexity they are:
Systems that use the discrete speech method of word separation require the user to place a short pause between each spoken word. This slight bit of silence allows the speech system to recognize the beginning and ending of each word. The silences separate the words much like the space bar does when we type. The advantage of the discrete speech method is that it requires the least amount of computational resources. The disadvantage of this method is that it is not very user-friendly. Discrete speech systems can easily become confused if a person does not pause between words.
Systems that use word spotting avoid the need for users to pause in between each word by listening only for key words or phrases. Word spotting systems, in effect, ignore the items they do not know or care about and act only on the words they can match in their vocabulary. For example, suppose the speech system can recognize the word help, and knows to load the Windows Help engine whenever it hears the word. Under word spotting, the following phrases will all result in the speech engine invoking Windows Help:
As we can see, one of the disadvantages of word spotting is that the system can easily misinterpret the user's meaning. However, word spotting also has several key advantages. Word spotting allows users to speak normally, without employing pauses. Also, since word spotting systems simply ignore words they don't know and act only on key words, these systems can give the appearance of being more sophisticated than they really are. Word spotting requires more computing resources than discreet speech, but not as much as the last method of word separation-continuous speech.
Continuous speech systems recognize and process every word spoken. This gives the greatest degree of accuracy when attempting to understand a speaker's request. However, it also requires the greatest amount of computing power. First, the speech system must determine the start and end of each word without the use of silence. This is much like readingtextthathasnospacesinit (see!). Once the words have been separated, the system must look them up in the vocabulary and identify them. This, too, can take precious computing time. The primary advantage of continuous speech systems is that they offer the greatest level of sophistication in recognizing human speech. The primary disadvantage is the amount of computing resources they require.
Speaker dependence is a key factor in the design and implementation of a speech recognition system. In theory, we would like a system that has very little speaker dependence. This would mean that the same workstation could be spoken to by several people with the same positive results. People often speak quite differently from one another, however, and this can cause problems.
First, there is the case of accents. Just using the United States as an example, we can identify several regional sounds. Add to these the possibility that speakers may also have accents that come from outside the U.S. due to the influence of other languages (Spanish, German, Japanese), and we have a wide range of pronunciation for even the simplest of sentences. Speaker speed and pitch inflection can also vary widely, which can pose problems for speech systems that need to determine whether a spoken phrase is a statement or a question.
Speech systems fall into three categories in terms of their speaker dependence. They can be:
Speaker-independent systems require the most resources. They must be able to accurately translate human speech across as many dialects and accents as possible. Speaker-dependent systems require the least amount of computing resources. These systems require that the user "train" the system before it is able to accurately convert human speech. A compromise between the two approaches is the speaker-adaptive method. Speaker-adaptive systems are prepared to work without training, but increase their accuracy after working with the same speaker for a period of time.
The additional training required by speaker-dependent systems can be frustrating to users. Usually training can take several hours, but some systems can reach 90 percent accuracy or better after just five minutes of training. Users with physical disabilities, or those who find typing highly inefficient, will be most likely to accept using speaker-dependent systems.
Systems that will be used by many different people need the power of speaker independence. This is especially true for systems that will have short encounters with many different people, such as greeting kiosks at an airport. In such situations, training is unlikely to occur, and a high degree of accuracy is expected right away.
For systems where multiple people will access the same workstation over a longer period of time, the speaker-adaptive system will work fine. A good example would be a workstation used by several employees to query information from a database. The initial investment spent training the speech system will pay off over time as the same staff uses the system.
Word matching is the process of performing look-ups into the speech database. As each word is gathered (using the word separation techniques described earlier), it must be matched against some item in the speech engine's database. It is the process of word matching that connects the audio input signal to a meaningful item in the speech engine database.
There are two primary methods of word matching:
Under whole-word matching, the speech engine searches the database for a word that matches the audio input. Whole-word matching requires less search capability than phoneme matching. But, whole-word matching requires a greater amount of storage capacity. Under the whole-word matching model, the system must store a word template that represents each possible word that the engine can recognize. While quick retrieval makes whole-word matching attractive, the fact that all words must be known ahead of time limits the application of whole-word matching systems.
Phoneme matching systems keep a dictionary of language phonemes. Phonemes are the smallest unique sound part of a language, and can be numerous. For example, while the English language has 26 individual letters, these letters do not represent the total list of possible phonemes. Also, phonemes are not restricted by spelling conventions.
Consider the words Philip and fill up. These words have the same phonemes: f, eh, ul, ah, and pah. However, they have entirely different meanings. Under the whole-word matching model, these words could represent multiple entries in the database. Under the phoneme matching model, the same five phonemes can be used to represent both words.
As we may expect, phoneme matching systems require more computational resources, but less storage space.
The final element of a speech recognition system is the vocabulary. There are two competing issues regarding vocabulary: size and accuracy. As the vocabulary size increases, recognition improves. With large vocabularies, it is easy for speech systems to locate a word that matches the one identified in the word separation phase. However, one of the reasons it is easy to find a match is that more than one entry in the vocabulary may match the given input. For example, the words no and go are very similar to most speech engines. Therefore, as vocabulary size grows, the accuracy of speech recognition can decrease.
Contrary to what we might assume, a speech engine's vocabulary does not represent the total number of words it understands. Instead, the vocabulary of a speech engine represents the number of words that it can recognize in a current state or moment in time. In effect, this is the total number of "unidentified" words that the system can resolve at any moment.
For example, let's assume we have registered the following word phrases with our speech engine: "Start running Exchange" and "Start running Word." Before we say anything, the current state of the speech engine has four words: start, running, Exchange, and Word. Once we say "Start running" there are only two words in the current state: Exchange and Word. The system's ability to keep track of the possible next word is determined by the size of its vocabulary.
Small vocabulary systems (100 words or less) work well in situations where most of the speech recognition is devoted to processing commands. However, we need a large vocabulary to handle dictation systems. Dictation vocabularies can reach into tens of thousands of words. This is one of the reasons that dictation systems are so difficult to implement. Not only does the vocabulary need to be large, the resolutions must be made quite quickly.
A second type of speech service provides the ability to convert written text into spoken words. This is called text-to-speech (or TTS) technology. Just as there are a number of factors to consider when developing speech recognition engines (SR), there are a few issues that must be addressed when creating and implementing rules for TTS engines.
The four common issues that must be addressed when creating a TTS engine are as follows:
The first two factors deal with the creation of audio tones that are recognizable as human speech. The last two items are competing methods for interpreting text that is to be converted into audio.
The quality of a computerized voice is directly related to the sophistication of the rules that identify and convert text into an audio signal. It is not too difficult to build a TTS engine that can create recognizable speech. However, it is extremely difficult to create a TTS engine that does not sound like a computer. Three factors in human speech are very difficult to produce with computers:
Human speech has a special rhythm or prosody-a pattern of pauses, inflections, and emphasis that is an integral part of the language. While computers can do a good job of pronouncing individual words, it is difficult to get them to accurately mimic the tonal and rhythmic in-flections of human speech. For this reason, it is always quite easy to differentiate computer-generated speech from a computer playing back a recording of a human voice.
Another factor of human speech that computers have difficulty rendering is emotion. While TTS engines are capable of distinguishing declarative statements from questions or exclamations, computers are still not able to convey believable emotive qualities when rendering text into speech.
Lastly, every language has its own pronunciation anomalies. These are words that do not "play by the rules" when it comes to converting text into speech. Some common examples in English are dough and tough or comb and home. More troublesome are words such as read which must be understood in context in order to figure out their exact pronunciation. For example, the pronunciations are different in "He read the paper" or "She will now read to the class." Even more likely to cause problems is the interjection of technobabble such as "SQL," "MAPI," and "SAPI." All these factors make the development of a truly human-sounding computer-generated voice extremely difficult.
Speech systems usually offer some way to correct for these types of problems. One typical solution is to include the ability to enter the phonetic spelling of a word and relate that spelling to the text version. Another common adjustment is to allow users to enter control tags in the text to instruct the speech engine to add emphasis or inflection, or alter the speed or pitch of the audio output. Much of this type of adjustment information is based on phonemes, as described in the next section.
As we've discussed, phonemes are the sound parts that make up words. Linguists use phonemes to accurately record the vocal sounds uttered by humans when speaking. These same phonemes also can be used to generate computerized speech. TTS engines use their knowledge of grammar rules and phonemes to scan printed text and generate audio output.
The SAPI design model recognizes and allows for the incorporation of phonemes as a method for creating speech output. Microsoft has developed an expression of the International Phonetic Alphabet (IPA) in the form of Unicode strings. Programmers can use these strings to improve the pronunciation skills of the TTS engine, or to add entirely new words to the vocabulary.
As mentioned in the previous section on voice quality, most TTS engines provide several methods for improving the pronunciation of words. Unless we are involved in the development of a text-to-speech engine, we probably will not use phonemes very often.
Once the TTS knows what phonemes to use to reproduce a word, there are two possible methods for creating the audio output: synthesis or diphone concatenation.
The synthesis method uses calculations of a person's lip and tongue position, the force of breath, and other factors to synthesize human speech. This method is usually not as accurate as the diphone method. However, if the TTS uses the synthesis method for generating output, it is very easy to modify a few parameters and then create a new "voice."
Synthesis-based TTS engines require less overall computational resources, and less storage capacity. Synthesis-based systems are a bit more difficult to understand at first, but usually offer users the ability to adjust the tone, speed, and inflection of the voice rather easily.
The diphone concatenation method of generating speech uses pairs of phonemes (di meaning two) to produce each sound. These diphones represent the start and end of each individual speech part. For example, the word pig contains the diphones silence-p, p-i, i-g, and g-silence. Diphone TTS systems scan the word and then piece together the correct phoneme pairs to pronounce the word.
These phoneme pairs are produced not by computer synthesis, but from actual recordings of human voices that have been broken down to their smallest elements and categorized into the various diphone pairs. Since TTS systems that use diphones are using elements of actual human speech, they can produce much more human-like output. However, since diphone pairs are very language-specific, diphone TTS systems are usually dedicated to producing a single language. Because of this, diphone systems do not do well in environments where numerous foreign words may be present, or where the TTS might be required to produce output in more than one language.
The final elements of a speech engine are the grammar rules. Grammar rules are used by speech recognition (SR) software to analyze human speech input and, in the process, attempt to understand what a person is saying. Most of us suffered through a series of lessons in grade school where our teachers attempted to show us just how grammar rules affect our everyday speech patterns. And most of us probably don't remember a great deal from those lessons, but we all use grammar rules every day without thinking about them, to express ourselves and make sense of what others say to us. Without an understanding of and appreciation for the importance of grammars, computer speech recognition systems would not be possible.
There can be any number of grammars, each composed of a set of rules of speech. Just as humans must learn to share a common grammar in order to be understood, computers must also share a common grammar with the speaker in order to convert audio information into text.
Grammars can be divided in to three types, each with its own strengths and weaknesses. The types are:
Context-free grammars offer the greatest degree of flexibility when interpreting human speech. Dictation grammars offer the greatest degree of accuracy when converting spoken words into printed text. Limited domain grammars offer a compromise between the highly flexible context-free grammar and the restrictive dictation grammar.
The following sections discuss each grammar type in more detail.
Context-free grammars work on the principle of following established rules to determine the most likely candidates for the next word in a sentence. Context-free grammars do not work on the idea that each word should be understood within a context. Rather, they evaluate the relationship of each word and word phrase to a known set of rules about what words are possible at any given moment.
The main elements of a context-free grammar are:
Context-free grammars are good for systems that have to deal with a wide variety of input. Context-free systems are also able to handle variable vocabularies. This is because most of the rule-building done for context-free grammars revolves around declaring lists and groups of words that fit into common patterns or rules. Once the SR engine understands the rules, it is very easy to expand the vocabulary by expanding the lists of possible members of a group.
For example, rules in a context-free grammar might look something like this:
<SendMailRule>=("Send Email to", <NameRule>)
In the example above, two rules have been established. The first rule, <NameRule>, creates a list of possible names. The second rule, <SendMailRule>, creates a rule that depends on <NameRule>. In this way, context-free grammars allow us to build our own grammatical rules as a predictor of how humans will interact with the system.
Even more importantly, context-free grammars allow for easy expansion at run-time. Since much of the way context-free grammars operate focuses on lists, it is easy to allow users to add list members and, therefore, to improve the value of the SR system quickly. This makes it easy to install a system with only basic components. The basic system can be expanded to meet the needs of various users. In this way, context-free grammars offer a high degree of flexibility with very little development cost or complication.
The construction of quality context-free grammars can be a challenge, however. Systems that only need to do a few things (such as load and run programs, execute simple directives, and so on) are easily expressed using context-free grammars. However, in order to perform more complex tasks or a wider range of chores, additional rules are needed. As the number of rules and the length of lists increases, the computational load rises dramatically. Also, since context-free grammars base their predictions on predefined rules, they are not good for tasks like dictation, where a large vocabulary is most important.
Unlike context-free grammars that operate using rules, dictation grammars base their evaluations on vocabulary. The primary function of a dictation grammar is to convert human speech into text as accurately as possible. In order to do this, dictation grammars need not only a rich vocabulary to work from, but also a sample output to use as a model when analyzing speech input. Rules of speech are not important to a system that must simply convert human input into printed text.
The elements of a dictation grammar are:
The success of a dictation grammar depends on the quality of the vocabulary. The more items on the list, the greater the chance of the SR engine mistaking one item for another. However, the more limited the vocabulary, the greater the number of "unknown" words that will occur during the course of the dictation. The most successful dictation systems balance vocabulary depth and the uniqueness of the words in the database. For this reason, dictation systems are usually tuned for one topic, such as legal or medical dictation. By limiting the vocabulary to the words most likely to occur in the course of dictation, translation accuracy is increased.
Limited domain grammars offer a compromise between the flexibility of context-free grammars and the accuracy of dictation grammars. Limited domain grammars have the following elements:
Limited domain grammars are useful in situations where the vocabulary of the system need not be very large. Examples include systems that use natural language to accept command statement, such as "How can I set the margins?" or "Replace all instances of 'New York' with 'Los Angeles.'" Limited domain grammars also work well for filling in forms or for simple text entry.
Send mail to firstname.lastname@example.org with questions or comments about this web site.