How Speech Recognition Works
You might have already used speech recognition in products, and maybe even incorporated it into your own application, but you still dont know how it works. This document will give you a technical overview of speech recognition so you can understand how it works, and better understand some of the capabilities and limitations of the technology.
Speech recognition fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio from a sound card into recognized speech. The elements of the pipeline are:
Ill cover each of these steps individually
Transform the PCM digital audio
The first element of the pipeline converts digital audio coming from the sound card into a format thats more representative of what a person hears. The digital audio is a stream of amplitudes, sampled at about 16,000 times per second. If you visualize the incoming data, it looks just like the output of an oscilloscope. Its a wavy line that periodically repeats while the user is speaking. While in this form, the data isnt useful to speech recognition because its too difficult to identify any patterns that correlate to what was actually said.
To make pattern recognition easier, the PCM digital audio is transformed into the "frequency domain." Transformations are done using a windowed fast-Fourier transform. The output is similar to what a spectrograph produces. In frequency domain, you can identify the frequency components of a sound. From the frequency components, its possible to approximate how the human ear perceives the sound.
The fast Fourier transform analyzes every 1/100th of a second and converts the audio data into the frequency domain. Each 1/100th of a second results is a graph of the amplitudes of frequency components, describing the sound heard for that 1/100th of a second. The speech recognizer has a database of several thousand such graphs (called a codebook) that identify different types of sounds the human voice can make. The sound is "identified" by matching it to its closest entry in the codebook, producing a number that describes the sound. This number is called the "feature number." (Actually, there are several feature numbers generated for every 1/100 the of a second but the process is easier to explain assuming only one.)
The input to the speech recognizer began as a stream of 16,000 PCM values per second. By using fast Fourier transforms and the codebook, it is boiled down into essential information, producing 100 feature numbers per second.
Figure Out Which Phonemes Are Spoken
Im going to jump out of order here. To make the recognition process easier to understand, Ill first explain how the recognizer determines what phonemes were spoken and then explain the grammars.
In an ideal world, you could match each feature number to a phoneme. If a segment of audio resulted in feature #52, it could always mean that the user made an "h" sound. Feature #53 might be an "f" sound, etc. If this were true, it would be easy to figure out what phonemes the user spoke.
Unfortunately, this doesnt work because of a number of reasons:
The background noise and variability problems are solved by allowing a feature number to be used by more than just one phoneme, and using statistical models to figure out which phoneme is spoken. This can be done because a phoneme lasts for a relatively long time, 50 to 100 feature numbers, and its likely that one or more sounds are predominant during that time. Hence, its possible to predict what phoneme was spoken.
Actually, the approximation is a bit more complex than this. Ill explain by starting at the origin of the process. For the speech recognition to learn how a phoneme sounds, a training tool is passed hundreds of recordings of the phoneme. It analyzes each 1/100 th of a second of these hundreds of recordings and produces a feature number. From these it learns statistics about how likely it is for a particular feature number to appear in a specific phoneme. Hence, for the phoneme "h", there might be a 55% chance of feature #52 appearing in any 1/100 th of a second, 30% chance of feature #189 showing up, and 15% chance of feature #53. Every 1/100 th of a second of an "f" sound might have a 10% chance of feature #52, 10% chance of feature #189, and 80% chance of feature #53.
The probability analysis done during training is used during recognition. The 6 feature numbers that are heard during recognition might be:
The recognizer computes the probability of the sound being an "h" and the probability of it being any other phoneme, such as "f". The probability of "h" is:
The probability of the sound being an "f" is:
You can see that given the current data, "h" is a more likely candidate. (For those of you that are mathematical sticklers, youll notice that the "probabilities" are no longer probabilities because they dont sum to one. From now on Ill call them "scores" since theyre un-normalized probabilities.)
The speech recognizer needs to know when one phoneme ends and the next begins. Speech recognition engines use a mathematical technique called "Hidden Markov Models" (HMMs) that figure this out. This article wont get into the mathematics of how HMMs work, but here's an intuitive feel. Lets assume that the recognizer heard a word with an "h" phoneme followed by an "ee" phoneme. The "ee" phoneme has a 75% chance of producing feature #82 every 1/100 th of a second, 15% of chance feature #98, and a 10% chance of feature #52. Notice that feature #52 also appears in "h". If you saw a lineup of the data, it might look like this:
So where does the "h" end and the "ee" begin? From looking at the features you can see that the 52s are grouped at the beginning, and the 82s grouped at the end. The split occurs someplace in-between. Humans can eye-ball this. Computers use Hidden Markov Models.
By the way, the speech recognizer figures out when speech starts and stops because it has a "silence" phoneme, and each feature number has a probability of appearing in silence, just like any other phoneme.
Now our recognizer can recognize what phoneme was spoken if theres background noise or the users voice had some variation. However, theres another problem. The sound of phonemes changes depending upon what phoneme came before and after. You can hear this with words such as "he" and "how". You dont speak a "h" followed by an "ee" or "ow", but the vowels intrude into the "h", so the "h" in "he" has a bit of "ee" in it, and the "h" in "how" as a bit of "ow" in it.
Speech recognition engines solve the problem by creating "tri-phones", which are phonemes in the context of surrounding phonemes. Thus, theres a tri-phone for "silence-h-ee" and one for "silence-h-ow". Since there are roughly 50 phonemes in English, you can calculate that there are 50*50*50 = 125,000 tri-phones. Thats just too many for current PCs to deal with so similar sounding tri-phones are grouped together.
And now for our last issue. The sound of a phoneme is not constant. A "t" sound is silent at first, then produces a sudden burst high frequency of noise, which then fades to silence. Speech recognizers solve this by splitting each phoneme into several segments and generating a different senone for each segment. The recognizer figures out where each segment begins and ends in the same way it figures out where a phoneme begins and ends.
After all this work, the speech recognizer has all the mechanics necessary to recognize if a particular phoneme was spoken. An important question still needs answering: How does it determine which phoneme to look for?
A speech recognizer works by hypothesizing a number of different "states" at once. Each state contains a phoneme with a history of previous phonemes. The hypothesized state with the highest score is used as the final recognition result.
When the speech recognizer starts listening it has one hypothesized state. It assumes the user isnt speaking and that the recognizer is hearing the "silence" phoneme. Every 1/100 th of a second it hypothesizes that the user has started speaking, and adds a new state per phoneme, creating 50 new states, each with a score associated with it. After the first 1/100 th of a second the recognizer has 51 hypothesized states.
In 1/100 th of a second, another feature number comes in. The scores of the existing states are recalculated with the new feature. Then, each phoneme has a chance of transitioning to yet another phoneme, so 51 * 50 = 2550 new states are created. The score of each state is the score of the first 1/100 th of a second times the score if the 2 nd 1/100 th of a second. After 2/100 ths of a second the recognizer has 2601 hypothesized states.
This same process is repeated every 1/100 th of a second. The score of each new hypothesis is the score of its parent hypothesis times the score derived from the new 1/100 th of a second. In the end, the hypothesis with the best score is whats used as the recognition result.
Of course, a few optimizations are introduced.
If the score of a hypothesis is much lower than the highest score then the hypothesis is dropped. This is called pruning. The optimization is intuitively obvious. If the recognizer is millions of times more confident that it heard "h eh l oe" than "z z z z," then theres not much point in continuing the hypothesis that the recognizer heard, "z z z z". However, if too much is pruned then errors can be introduced since the recognizer might make a mistake about which phoneme was spoken.
Recognizers also optimize by not hypothesizing a transition to a new phoneme ever 1/100 th of a second. To do this though, the recognizer must limit what phonemes can follow other phonemes.
Reducing Computation and Increasing Accuracy
The speech recognizer can now identify what phonemes were spoken. Figuring out what words were spoken should be an easy task. If the user spoke the phonemes, "h eh l oe", then you know they spoke "hello". The recognizer should only have to do a comparison of all the phonemes against a lexicon of pronunciations.
Its not that simple.
To reduce computation and increase accuracy, the recognizer restricts acceptable inputs from the user. On the whole, this isnt a bad assumption because:
Context Free Grammar
One of the techniques to reduce the computation and increase accuracy is called a "Context Free Grammar" (CFG). CFGs work by limiting the vocabulary and syntax structure of speech recognition to only those words and sentences that are applicable to the applications current state.
The application specifies the vocabulary and syntax structure in a text file. The text file might look like this:
This grammar translates into seven possible sentences:
Of course, the grammars can be much more complex than this. The important feature about the CFG is that it limits what the recognizer expects to hear to a small vocabulary and tight syntax. After the user speaks "Send" the recognizer will only listen to "mail". This significantly reduces the number of generated hypothesis.
The speech recognition gets the phonemes for each word by looking the word up in a lexicon. If the word isnt in the lexicon then it predicts the pronunciation; See the "How Text-to-Speech Works" document for an explanation of pronunciation prediction. Some words have more than one pronunciation, such as "read" which can be pronounced like "reed" or "red". The recognizer basically treats one word with multiple pronunciations the same as two "words".
CFGs slightly change the hypothesis portion of speech recognition. Rather than hypothesizing the transition to all phonemes, the recognizer merely hypothesizes the transition to the next acceptable phonemes. From the initial "silence" phoneme the recognizer hypothesizes the "s" in send, "k" in "call", and "eh" in exit. If the recognizer hypothesizes phoneme transitions from the "s" phoneme, it will only hypothesis "eh", followed by "n", "d", "m", "ae", "l", etc.
You can see how this significantly reduces the computation. Instead of increasing the number of hypotheses by a factor of 50 each time, the number of hypotheses stay constant within a word, and only increase a little bit on word transitions. Given a normal amount of pruning, there are no more than about 10 hypotheses around at a time.
When the user has finished speaking, the recognizer returns the hypothesis with the highest score, and the words that the user spoke are returned to the application.
Speech recognition using a CFG requires a 486/33 to run real-time.
Context Free Grammars dont allow users to speak arbitrary text. If a user wishes to dictate text into a document the speech recognizer must work differently and consume a bit more CPU. Ill explain how a "discrete dictation" speech recognition engine works internally.
With discrete dictation, a user can speak any word he/she wishes out of a vocabulary, usually 60,000 words. However, the user must leave quarter second pauses between words, making it easy for the recognizer to tell where words begin and end. Discrete dictation systems require a Pentium 60. If the user doesnt leave pauses its called "continuous dictation." Ill cover that next.
The simple approach to discrete dictation is to create a Context Free Grammar thats just a list of words. This will work, but it will have a few problems:
In the end, using just a CFG will produce less than 80% word accuracy.
The speed and accuracy can be improved by knowing the grammar of the language, and how likely word combinations are. Theres little point in hypothesizing phrases such as "New York Aardvark".
To learn how words adjoin one-another, the recognizer adds another step to producing the engines recognition database. In addition to using recordings of the phonemes, the training tools analyze a huge amount of raw text data, ideally 1,000,000,000 words or so, about 8 gigabytes of text. The text is segmented into words. For example, the sentence:
Is converted to:
Sequences of three words are then counted, producing a file that essentially indicates how common any sequence of three words is:
The real database will be generated from over a billion words of text, so it will be much more complete than what you see above. A short segment of the real table might look like this:
The term "New York City" appeared 553 times, while "New York Sid" only occurred once. From this the recognizer know that if the user has spoken "New York" that its much more likely theyre saying "city" than "Sid".
Now, lets expand on the simple approach of the Context Free Grammar.
The statistics can be converted to probabilities, and these probabilities can be incorporated into each of the hypotheses generated during recognition. The score of a hypothesis is equal to the acoustic score of the hypothesis times the probability that the word will appear in the given context. This is called a language model.
Example: The score of the phrase "New York City" is the acoustic score of "New York City" times 483, while the score of the phrase "New York Sid" is the acoustic score of "New York Sid" times 1. Another way of putting this is that the recognizer will only claim to have heard "Sid" if it is much more confident that its heard the word "Sid" (following "New York") than "City".
Sometimes a word combination was never seen in the original database. Perhaps none of the text ever mentions "New York Aardvark," but the user might still speak it. The recognizer always leaves a small but finite chance that any word will follow any other word. It just has to be very sure that it heard the acoustics of the word correctly.
The language model will reduce the error rate by a factor of four, taking accuracy from 80% to 95%. Heres why:
Recognition is sped up because unlikely word combinations are quickly discarded.
Dictation applications frequently allow the user to correct mistakes by showing the top-10 alternatives to the user and allowing him/her to click on one. The list is constructed from the top-10 losing hypotheses for the word.
Continuous dictation allows the user to speak anything he/she wants out of a large vocabulary. This is more difficult than discrete dictation because the speech recognition engine doesnt easily know where one word ends and the next begins. For example: Speak out loud "recognize speech" and "wreck a nice beach" quickly; They both sound similar.
Continuous dictation works similar to discrete dictation except the end of a word is not detected by silence. Rather, when a hypothesis reaches the end of a word in continuous dictation, it then produces thousands of new hypotheses and prunes those out. The language model probability helps to prune the hypothesis down a lot more in continuous dictation.
Recognizers use a lot more optimizations to optimize processing and memory in continuous dictation systems. The article wont cover those here because their description doesnt help explain the underlying technology.
Speech recognition system "adapt" to the users voice, vocabulary, and speaking style to improve accuracy. A system that has had time enough to adapt to an individual can have one fourth the error rate of a speaker independent system. Adaptation works because the speech recognition is often informed (directly or indirectly) by the user if its recognition was correct, and if not, what the correct recognition is.
The recognizer can adapt to the speakers voice and variations of phoneme pronunciations in a number of ways. First, it can gradually adapt the codebook vectors used to calculate the acoustic feature number. Second, it can adapt the probability that a feature number will appear in a phoneme. Both of these are done by weighted averaging.
The language model can also be adapted in a number of ways. The recognizer can learn new words, and slowly increase probabilities of word sequences so that commonly used word sequences are expected. Both these techniques are useful for learning names.
Although not common, the speech recognizer can adapt word pronunciations in its lexicon. Each word in a lexicon typically has one pronunciation. The word "orange" might be pronounced like "or-anj". However, users will sometimes speak "ornj" or "or-enj". The recognizer can algorithmically generate hypothetical alternative pronunciations for a word. It then listens for all of these pronunciations during standard recognition, "or-anj", "or-enj", "or-inj", and "ornj". During the process of recognition one of these pronunciations will be heard, although theres a fair chance that the recognizer heard a different pronunciation than what the user spoke. However, after the user has spoken the word a number of times the recognizer will have enough examples that it can determine what pronunciation the user spoke.
This was a high level overview of how speech recognition works. Its not nearly enough detail to actually write a speech recognizer, but it exposes the basic concepts. Most speech recognition engines work in a similar manner, although not all of them work this way. If you want more detail you should purchase one of the numerous technical books on speech recognition.
Send mail to email@example.com with questions or comments about this web site.