How Text-to-Speech Works
You might have already used text-to-speech in products, and maybe even incorporated it into your own application, but you still dont know how it works. This document will give you a technical overview of text-to-speech so you can understand how it works, and better understand some of the capabilities and limitations of the technology.
Text-to-speech fundamentally functions as a pipeline that converts text into PCM digital audio. The elements of the pipeline are:
Ill cover each of these steps individually
The "text normalization" component of text-to-speech converts any input text into a series of spoken words. Trivially, text normalization converts a string like "John rode home." to a series of words, "john", "rode", "home", along with a marker indicating that a period occurred. However, this gets more complicated when strings like "John rode home at 23.5 mph", where "23.5 mph" is converted to "twenty three point five miles per hour". Heres how text normalization works:
First, text normalization isolates words in the text. For the most part this is as trivial as looking for a sequence of alphabetic characters, allowing for an occasional apostrophe and hyphen.
Text normalization then searches for numbers, times, dates, and other symbolic representations. These are analyzed and converted to words. (Example: "$54.32" is converted to "fifty four dollars and thirty two cents.") Someone needs to code up the rules for the conversion of these symbols into words, since they differ depending upon the language and context.
Next, abbreviations are converted, such as "in." for "inches", and "St." for "street" or "saint". The normalizer will use a database of abbreviations and what they are expanded to. Some of the expansions depend upon the context of surrounding words, like "St. John" and "John St.".
The text normalizer might perform other text transformations such as internet addresses. "http://www.Microsoft.com" is usually spoken as "w w w dot Microsoft dot com".
Whatever remains is punctuation. The normalizer will have rules dictating if the punctuation causes a word to be spoken or if it is silent. (Example: Periods at the end of sentences are not spoken, but a period in an Internet address is spoken as "dot.")
The rules will vary in complexity depending upon the engine. Some text normalizers are even designed to handle E-mail conventions like "You ***WILL*** go to the meeting. :-("
Once the text has been normalized and simplified into a series of words, it is passed onto the next module, homograph disambiguation.
The next stage of text-to-speech is called "homograph disambiguation." Often its not a stage by itself, but is combined into the text normalization or pronunciation components. Ive separated homograph disambiguation out since it doesnt fit cleanly into either.
In English and many other languages, there are hundreds of words that have the same text, but different pronunciations. A common example in English is "read," which can be pronounced "reed" or "red" depending upon its meaning. A "homograph" is a word with the same text as another word, but with a different pronunciation. The concept extends beyond just words, and into abbreviations and numbers. "Ft." has different pronunciations in "Ft. Wayne" and "100 ft.". Likewise, the digits "1997" might be spoken as "nineteen ninety seven" if the author is talking about the year, or "one thousand nine hundred and ninety seven" if the author is talking about the number of people at a concert.
Text-to-speech engines use a variety of techniques to disambiguate the pronunciations. The most robust is to try to figure out what the text is talking about and decide which meaning is most appropriate given the context. Once the right meaning is know, its usually easy to guess the right pronunciation.
Text-to-speech engines figure out the meaning of the text, and more specifically of the sentence, by parsing the sentence and figuring out the part-of-speech for the individual word. This is done by guessing the part-of-speech based on the word endings, or by looking the word up in a lexicon. Sometimes a part of speech will be ambiguous until more context is known, such as for "read." Of course, disambiguation of the part-of-speech may require hand-written rules.
Once the homographs have been disambiguated, the words are sent to the next stage to be pronounced.
The pronunciation module accepts the text, and outputs a sequence of phonemes, just like you see in a dictionary.
To get the pronunciation of a word, the text-to-speech engine first looks the word up in its own pronunciation lexicon. If the word is not in the lexicon then the engine reverts to "letter to sound" rules.
Letter-to-sound rules guess the pronunciation of a word from the text. Theyre kind of the inverse of the spelling rules you were taught in school. There are a number of techniques for guessing the pronunciation, but the algorithm described here is one of the more easily implemented ones.
The letter-to-sound rules are "trained" on a lexicon of hand-entered pronunciations. The lexicon stores the word and its pronunciation, such as:
An algorithm is used to segment the word and figure out which letter "produces" which sound. You can clearly see that "h" in "hello" produces the "h" phoneme, the "e" produces the "eh" phoneme, the first "l" produces the "l" phoneme, the second "l" nothing, and "o" produces the "oe" phoneme. Of course, in other words the individual letters produce different phonemes. The "e" in "he" will produce the "ee" phoneme.
Once the words are segmented by phoneme, another algorithm determines which letter or sequence of letters is likely to produce which phonemes. The first pass figures out the most likely phoneme generated by each letter. "H" almost always generates the "h" sound, while "o" almost always generates the "ow" sound. A secondary list is generated, showing exceptions to the previous rule given the context of the surrounding letters. Hence, an exception rule might specify that an "o" occurring at the end of the word and preceded by an "l" produces an "oe" sound. The list of exceptions can be extended to include even more surrounding characters.
When the letter-to-sound rules are asked to produce the pronunciation of a word they do the inverse of the training model. To pronounce "hello", the letter-to-sound rules first try to figure out the sound of the "h" phoneme. It looks through the exception table for an "h" beginning the word followed by "e"; Since it cant find one it uses the default sound for "h", which is "h". Next, it looks in the exceptions for how an "e" surrounded by "h" and "l" is pronounced, finding "eh". The rest of the characters are handled in the same way.
This technique can pronounce any word, even if it wasnt in the training set, and does a very reasonable guess of the pronunciation, sometimes better than humans. It doesnt work too well for names because most names are not of English origin, and use different pronunciation rules. (Example: "Mejia" is pronounced as "meh-jee-uh" by anyone that doesnt know it is Spanish.) Some letter-to-sound rules first guess what language the word came from, and then use different sets of rules to pronounce each different language.
Word pronunciation is further complicated by peoples laziness. People will change the pronunciation of a word based upon what words precede or follow it, just to make the word easier to speak. An obvious example is the way "the" can be pronounced as "thee" or "thuh." Other effects including the dropping or changing of phonemes. A commonly used phrase such as "What you doing?" sounds like "Wacha doin?"
Once the pronunciations have been generated, these are passed onto the prosody stage.
Prosody is the pitch, speed, and volume that syllables, words, phrases, and sentences are spoken with. Without prosody text-to-speech sounds very robotic, and with bad prosody text-to-speech sounds like its drunk.
The technique that engines use to synthesize prosody varies, but there are some general techniques.
First, the engine identifies the beginning and ending of sentences. In English, the pitch will tend to fall near the end of a statement, and rise for a question. Likewise, volume and speaking speed ramp up when the text-to-speech first starts talking, and fall off on the last word when it stops. Pauses are placed between sentences.
Engines also identify phrase boundaries, such as noun phrases and verb phrases. These will have similar characteristics to sentences, but will be less pronounced. The engine can determine the phrase boundaries by using the part-of-speech information generated during the homograph disambiguation. Pauses are placed between phrases or where commas occur.
Algorithms then try to determine which words in the sentence are important to the meaning, and these are emphasized. Emphasized words are louder, longer, and will have more pitch variation. Words that are unimportant, such as those used to make the sentence grammatically correct, are de-emphasized. In a sentence such as "John and Bill walked to the store," the emphasis pattern might be "JOHN and BILL walked to the STORE." The more the text-to-speech engine "understands" whats being spoken, the better its emphasis will be.
Next, the prosody within a word is determined. Usually the pitch and volume rise on stressed syllables.
All of the pitch, timing, and volume information from the sentence level, phrase level, and word level are combined together to produce the final output. The output from the prosody module is just a list of phonemes with the pitch, duration, and volume for each phoneme.
The speech synthesis is almost done by this point. All the text-to-speech engine has to do is convert the list of phonemes and their duration, pitch, and volume, into digital audio.
Methods for generating the digital audio will vary, but many text-to-speech engines generate the audio by concatenating short recordings of phonemes. The recordings come from a real person. In a simplistic form, the engine receives the phoneme to speak, loads the digital audio from a database, does some pitch, time, and volume changes, and sends it out to the sound card.
It isnt quite that simple for a number of reasons.
Most noticeable is that one recording of a phoneme wont have the same volume, pitch, and sound quality at the end, as the beginning of the next phoneme. This causes a noticeable glitch in the audio. An engine can reduce the glitch by blending the edges of the two segments together so at their intersections they both have the same pitch and volume. Blending the sound quality, which is determined by the harmonics generated by the voice, is more difficult, and can be solved by the next step.
The sound that a person makes when he/she speaks a phoneme, changes depending upon the surrounding phonemes. If you record "cat" in sound recorder, and then reverse it, the reversed audio doesnt sound like "tak", which has the reversed phonemes of cat. Rather than using one recording per phoneme (about 50), the text-to-speech engine maintains thousands of recordings (usually 1000-5000). Ideally it would have all possible phoneme context combinations recorded, 50 * 50 * 50 = 125,000, but this would be too many. Since many of these combinations sound similar, one recording is used to represent the phoneme within several different contexts.
Even a database of 1000 phoneme recordings is too large, so the digital audio is compressed into a much smaller size, usually between 8:1 and 32:1 compression. The more compressed the digital audio, the more muted the voice sounds.
Once the digital audio segments have been concatenated theyre sent off to the sound card, making the computer talk.
Generating a Voice
You might be wondering, "How do you get thousands of recordings of phonemes?"
The first step is to select a voice talent. The voice talent then spends several hours in a recording studio reading a wide variety of text. The text is designed so that as many phonemes sequence combinations are recorded as possible. You at least want them to read enough text so there are several occurrences of each of the 1000 to 5000 recording slots.
After the recording session is finished, the recordings are sent to a speech recognizer which then determines where the phonemes begin and end. Since the tools also knows the surrounding phonemes, its easy to pull out the right recordings from the speech. The only trick is to figure out which recording sounds best. Usually an algorithm makes a guess, but someone must listening to the phoneme recordings just to make sure theyre good.
The selected phoneme recordings are compressed and stored away in the database. The result is a new voice.
This was a high level overview of how text-to-speech works. Most text-to-speech engines work in a similar manner, although not all of them work this way. The overview doesnt give you enough detail to write your own text-to-speech engine, but now you know the basics. If you want more detail you should purchase one of the numerous technical books on text-to-speech.
Send mail to firstname.lastname@example.org with questions or comments about this web site.