|
|
Phonemes are the fundamental particles of meaningful sound in spoken language. They are each ( by definition ) different from the rest, but are not unchanging in themselves. Whenever one is voiced, it is identifiably like its previous occurrences yet not exactly the same. A phoneme's sound, and the neighboring phonemes are forever varying. The question for computer speech is how to model this variation mathematically. A start can be made by dividing the phonemes into three audible states: an introductory state, on the way in from the previous phoneme; a middle state, and an existing state, on the way out to the next phoneme. This chain of states and the transitions between them is reminiscent of the Markov Chain and still more of this offspring, the hidden Markov model. A Markov chain consists of a number of states, linked by a number of states, linked by a number of possible transitions. Associated with each transition is a probability, and associated with each state is a symbol that is the output. Take a three state Markov Chain, with transition probabilities aij between state i and j. Here the symbol A is associated with the state 1, the symbol B with the state 2 and the symbol C with the state 3. As the state is entered, it produces its associated symbol as output. Here entering state 2 produces B as output. These output symbols are also so called because a Markov chain is thought of as a generative model: it outputs a symbol every time a state is entered. Note that, in a Markov chain, the shift (transition) from one state to another is probabilistic, but the production of the output symbol is deterministic. Now given a sequence of output symbols that were generated by a Markov chain, it is possible to infer the corresponding sequence of states completely and unambiguously, provided the output symbol for each state is unique. For example, the symbol sequence B A A C B B A C C C A is produced by transitioning into the following sequence of states 2 1 1 3 2 2 1 3 3 3 1 Hidden Markov Model:A Hidden Markov Model (HMM) is the same as Markov Chain, except that the output symbol as well as the transitions are probabilistic. No longer is there one output symbol per state; instead, all symbols are possible at each state in an HMM, and each symbol has its own likelihood of occurring. Thus, associated with each state is a probability distribution of all the output symbols. Further, there can be any number of output symbols. Each HMM state then may have a set of output symbols each with its own distribution of probabilities, knowing as out output probabilities. ( If the discrete number of output symbols is replaced by a continuously valued vector, then probability density function can be defined over all possible values of the random output vector ( as will be seen below). Next figure illustrates a there state HMM. It has the same transition probabilities s the Markov chain shown first. what is different is that a probability distribution bi(s) is associated with each state i, defined over the set of output symbols s __ in this case, five outputs symbols : A B C D and E. Now, every output symbol is chosen according to the probability distribution corresponding to that state. Compared to Markov Chain, the output sequences generated by an HMM are what is known as doubly stochastic: not only is the transitioning from one state to another stochastic (probabilistic), but so is the output symbol generated at each state. A sequence of symbols outputted from HMM cannot be retraced unambiguously. Any and every sequence. Any and every sequence of states that has the same length as the symbol sequence is possible, each with a different probability. The sequence of states is said to be "hidden" from the observer who only sees the output symbol sequence, and that is why these symbols are called Hidden Markov Model. Even though absolute certainty is out the question, one might be still interested in the question of states most likely to have generated the given sequence of symbols. This is, in effect, the speech recognition problem. Finding the answer requires a search procedure that, in principle, has to look at all possible state sequences and compute their probabilities. This is a huge task, but there is an out. The Markov nature of the an HMM ( namely, that the probability of being in a state is dependent only on the previous state) admits use of the Viterbi algorithm, most likely to have generated the given sequence of symbols, without having to search all possible sequences. Phonetic HMMsFigure shows a three state hidden Markov Model for a single phoneme. Besides the transition probabilities among the states, each state has an associated probability density defined over x, where x represents the feature vector of the sound spectrogram computed every frame about 10ms. Gone is the set of discrete output symbols. In its place is an infinite, continuous set of symbols comprising the uncountable possible values of the feature vector x. Usually, there is one HMM per Phonetic context of interest. Usually, too, all such models have the same structure but are distinguished from each other by their transition and output probabilities. ( A big advance in speech recognition technology has been the development of the better methods fro modeling the HMM output probability densities in high dimensional vector spaces. The most popular method is as a sum of high-dimensional Gaussian densities.) Not all transitions are allowed in this last situation. Put another way, non-appearing transitions have a zero probability. In speech, for instance, time flows forward only, so that transitions are allowed from left to right but not from right to left. Transitions from any state back to itself serve to model temporal variability -- a necessity for speech, since different instantiations of phonemes and words are uttered with different time registration. The transition from state 1 to state 3 indicates that the shortest phoneme that is modeled is two frames, or 20 ms long. Such a phoneme would occupy state 1 for one frame and state 3 for one frame only. One explanation of for the need for three states, in general, is that they correspond roughly to the left, middle and right parts of the phoneme. ( If more state are used, more training data would be needed to estimate their parameters reliably.) Shown at the bottom of the last figure is a sequence of feature vectors and their correspondence with the states of the HMM. In the sequence, as state 1 of phoneme k is entered from phoneme k-1, one of the feature vectors is generated from state 1's probability distribution. Then the transition probabilities out of state 1 dictate whether a shift is made back to the same state, or to state 2, or to the state 3; whichever is picked, its probability distribution is the basis for generating another feature vector, and so on until a transition out of the state 3 is is made to phoneme k+1. In this last figure, the alignment between states and feature vectors shows that, during the process of generation, we stayed in state 1 for two frames, in state 2 for five frames, and in state 3 for three frames. The same model can be applied in recognition mode. In this mode, each model serves to compute the likelihood of a particular sequence of states having generated the sequence of feature vectors, given a certain alignment between vectors and states. That is to say, start with state 1 and a given input feature vector x. Then compute the probability of x from the probability density b1(x). A shift is then made from state 1 to itself, so the previous probability is multiplied by the transition probability from state to to itself (0.2, in this case). Now take a new feature vector, compute the output probability according to b1(x), and multiply the result by the previous product. The next shift is from state 1 to state 2, so the accumulated product is now multiplied by that transition probability (0.5). The process of multiplying transitions probabilities by output probabilities by transition probabilities continues until the sequence of feature vectors is exhausted. The final product yields the likelihood that the sequence of HMM states for the different phonemes gave rise to the sequence of input feature vectors. Changing alignments between the feature vectors and the many phoneme models changes the accumulated probabilities. The recognition problem is to find the sequence of phoneme models and the corresponding alignment that yields the highest probability. That phoneme sequence is given as the output of the recognition, and the corresponding alignment gives the input's segmentation into different phonemes. In this way, the recognition and segmentation occur simultaneously. HMM LimitationsHMMs have proved themselves to be a good model of speech variability in time and feature space simultaneously. Their automatic training from speech data has enhanced their speech recognition performance, and their probabilistic formulation provides a unified framework for scoring hypotheses and for combining knowledge sources. Current first-order HMM theory has two main limitations. First, there are the Markovian assumptions of conditional independence (being in a state depends only on the previous state). Second, HMMs are well defined only for processes that are a function of a single independent variable, such as time or one-dimensional position. The first limitation is not a fundamental one by any means. In principle, it is possible to define higher-order HMMs in which the dependence is extended to previous states and outputs. Nevertheless, such extensions complicate the models and can quickly land computation in difficulties. Perhaps as computer power grows by two or more orders of magnitude, it may become feasible to attempt more complicated models. In the meantime, much work remains to be done on improving and using the existing models. The second limitation, however, is the fundamental. It seems impossible in principle to define HMMS rigorously for more than a single independent variable. Therefore, for problems such as optical character recognition, where the signal is naturally a function of two variables, researchers had had to devise methods to convert the two dimensional problem to one dimensional problems before the model can be used to find the solutions. Even with this constraint, however, exploiting the excellent properties of HMMs has advanced the state of the art in a number of applications. Although speech recognition remains the dominant field in which HMMs are applied, their use has been spreading steadily to other fields. Among the other areas of applications are language understanding, topic classification and sonar target classification, optical character and handwriting recognition, recognizing human gestures, lip-reading from color motion, texture classification, electroencephalographic (EEG) pattern recognition, skill learning in robotics, modeling of burst errors in digital channels, and blind equalization of communication channels. Undoubtedly more potential uses for the HMM will be found as an understanding of its workings spreads.
|
Send mail to askazad@hotmail.com with questions or comments about this web site.
|