that is the basis of modern speech recognition
and handwriting recognition
systems. Hidden Markov models (HMMs) were developed by L. Rabiner at AT & T, and superseded dynamic time warping
. Dynamic time warping worked well for isolated word recognition
, but did not generalise well to continuous speech recognition
. An HMM consists of a few bits and pieces (note the similarity to Finite State Machine
- Some symbols that it emits. For now think of a symbol as a letter.
- Some states. An HMM might have 3 states, but as few as one and as many as a few hundred. At every time step a state emits a symbol.
- Some probabilities. Firstly the probability that in a given state, a particular symbol will be emitted, secondly the probability that the HMM will transition (i.e. jump) from the any state to any other state at the next time step. Many of these will be zero.
- A starting state and a final state.
So the HMM starts in a start state, emits a letter, jumps
to another state (or possibly stays in the same state), emits a letter and so on until it hits a final state.
What this allows you to do is now to ask a question like:
Given a particular HMM, and if I start in state 1, emit "a", go to state 2, emit "b", go to state 3, emit an "a", go back to state 1 and emit a "c", what is the probability of all of that happening for this HMM?
This question's fairly easy to answer ... you just multiply all the individual probabilities of the emissions and transitions together. But what if I ask the more difficult question without knowing the states it jumped through were, such as: What is the probability that this HMM generated the sequence a, b, a, c?
This is trickier. Several other sequences of transitions (e.g. rather than 1,2,3,1 you might have had 1,3,2,1 or even 1,1,1,1) could have produced that sequence. Not knowing what the states the system went through is why they are called "hidden" Markov models ... the state at any time is unknown. If you knew the state they would just be Markov models.
You can use the Viterbi algorithm to work out an approximation to the probability that a particular sequence of observations was generated by an HMM.
Fine, I hear you say ... what use is that for speech recognition? Well, imagine now that I have an HMM for every word. And rather than the symbols being letters, let them be the sounds that you made in a short chunk of time --usually about 50 milliseconds -- that we've processed in some way (usually using Fourier transforms and Cepstral coefficients) to simplify to phonemes. I record what you say, chop it into little chunks, then I work out the probability that the sounds were generated by this particular HMM. I do that for each HMM (i.e. for every word), then I then choose the word associated with the HMM that had the highest probability. Hey presto! I have a word that should correspond to the word that you spoke.
Why do HMMs work so well? Because of the way they're designed, it means that you can stretch words here and there or speed up the way that you say them and the basic idea will still work, you can say them slightly differently and it will still work, you can change the pitch of your voice and it will still work. Well, not always, but close enough to it to be useful. Basically, it's robust because it's a good stochastic model.
Ok, ok, but there are all these darned probabilities that need to be set? Where are they going to hire the drone that has to set the buggers? For every word there might be thousands of probabilities that need to be set.
Well, you don't. There is a thing called the Baum-Welch reestimation algorithm that can work out the probabilities from data. In other words, you buy thousands of hours of recordings of people talking (called a corpus) and you train your HMMs on them. This is an example of machine learning, and it's a nifty idea, because it means that the speech recognition software can start off with generic settings for everyone, but as time goes on, it tunes itself to understand your voice better.