Speech recognition, synthesis, analysis

Language modeling

Language modeling is essential to reduce the complexity of the natural language speech recognition task. Speech recognition is dealing with a natural language, and as such in any particular context, some words are more likely to appear than others. Linguistic and AI approaches to NLP are becoming extremely sophisticated, nut is difficult to develop general models (grammars) of the entire language. Leading speech recognition systems have taken a seemingly less sophisticated approach. Noting that (for now) we are looking at recognition, not understanding, speech recognition researchers have concentrated on very simple models of language:

n-grams: An n-gram is simply a sequence of n words, wl, w2, w3, .., wn. In n-gram modeling of language the building blocks of our model are probabilities of the form

p(wn/wn-l,.., wl): the probability of a word given the previous n-1 words. The complete model is then simply has the values 2 (bi-grams) or 3 (tri-grams). This can lead to a lot of probabilities: in the trigram case with a 20,0003 vocabiulary, there are potentially

3 12
20,000 =8×10
probabilities. Of course, most of these will have some floor values, but n-gram models often contain over 10 million separate probabilities.

n-gram language models are relatively easy to compute. Given a database of textual-data, all that has to be done is to count the occurance of the n-grams that occur, and from these counts compute the probabilities. Do n-grams work? After all, it is quite obvious that a language like English contains dependencies a lot longer-span than the previous two words! Well, they do work quite well in practice. Part of the reason why n-grams (and indeed Hidden Markov Models which are called HMMs work so well, it is that although they are poor models of languages they can be trained from enormous quantities of data. It is much more difficult to train and estimate the parameters of more complex models from large corpus amounts of knowledge data bases of the data. And, at the end of the day, it appears that the single most important thing in speech recognition at the moment is to train the models on as much data as possible. Currently the best speech recognizers are trained on about 80 hours of speech data, and 150 million words of text. This increased 5 times now.

It's very calm over here, why not leave a comment?

Leave a Reply

You must be logged in to post a comment.