|Publication number||US7406416 B2|
|Application number||US 10/810,254|
|Publication date||Jul 29, 2008|
|Filing date||Mar 26, 2004|
|Priority date||Mar 26, 2004|
|Also published as||CN1673997A, CN100535890C, DE602005025955D1, EP1580667A2, EP1580667A3, EP1580667B1, US20050216265|
|Publication number||10810254, 810254, US 7406416 B2, US 7406416B2, US-B2-7406416, US7406416 B2, US7406416B2|
|Inventors||Ciprian Chelba, Milind Mahajan, Alejandro Acero|
|Original Assignee||Microsoft Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (6), Non-Patent Citations (21), Referenced by (10), Classifications (12), Legal Events (3)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The present invention relates to language models. In particular, the present invention relates to storage formats for storing language models.
Language models provide probabilities for sequences of words. Such models are trained from a set of training data by counting the frequencies of sequences of words in the training data. One problem with training language models in this way is that sequences of words that are not observed in the training data will have zero probability in the language model, even though they may occur in the language.
To overcome this, back-off modeling techniques have been developed. Under a back-off technique, if a sequence of n words is not found in the training data, the probability for the sequence of words is estimated using a probability for a sequence of n−1 words and a back-off weight. For example, if a trigram (wn−2 wn−1 wn) is not observed in the training data, its probability is estimated using the probability of the bigram (wn−1 wn) and a back-off weight associated with the context (wn−2 wn−1).
N-gram language models that use back-off techniques are typically stored in a standard format referred to as the ARPA standard format. Because of the popularity of back-off language models, the ARPA format has become a recognized standard for transmitting language models. However, not all language models have back-off weights. In particular, deleted interpolation N-gram models do not have back-off weights because they use a different technique for handling the data sparseness problem associated with language models. As a result, deleted interpolation language models have not been stored in the standard ARPA format. Because of this, it has not been easy to integrate deleted interpolation language models into language systems that expect to receive the language model in the ARPA format.
A method and apparatus are provided for storing parameters of a deleted interpolation language model as parameters of a backoff language model. In particular, the parameters of the deleted interpolation language model are stored in the standard ARPA format. Under one embodiment, the deleted interpolation language model parameters are formed using fractional counts.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention is designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
The present invention provides a technique for storing a language model generated through deleted interpolation in the standard ARPA format. In deleted interpolation, an N-gram probability is determined as a linear interpolation of a relative frequency estimate for the N-gram probability and a probability for a lower order n-gram. The probability of the lower order n-gram is similarly defined as an interpolation between the relative frequency probability estimate for the lower order n-gram and next lower order n-gram. This continues until a unigram probability is determined. Thus, the interpolation is determined recursively according to:
P(v k |v k−(n−1) . . . v k−1)=(1−λn−1(v k−(n−1) . . . v k−1))f(v k |v k−(n−1) . . . v k−1)+λn−1(v k−(n−1) . . . v k−1)P(v k |v k−(n−2) . . . v k−1) EQ. 1
where P(vk|vk−(n−1) . . . vk−1) is the probability of the n-gram, λn−1(vk−(n−1) . . . vk−1) is an interpolation weight that is a function of the context vk−(n−1) . . . vk−1 of the N-gram, f(vk|vk−(n−1) . . . vk−1) is the relative frequency of the N-gram, which is the number of times the N-gram appears in the training text over the number of times the context of the N-gram appears in the training text, and P(vk|vk−(n−2) . . . vk−1) is the probability of the next lower order n-gram, which is determined recursively using equation 1 with a weight λn−2(vk−(n−2) . . . vk−1) that is a function of the context of the lower order n-gram. The recursion of Equation 1 ends with a unigram probability that is determined as:
where P(vk) is the unigram probability, λ0 is the unigram interpolation weight, f(vk) is the relative frequency for the unigram vk, which is the ratio of the number of times the unigram appears in the training text over the number of words in the training text, and |V| is the number of words in the vocabulary, which acts as a default unigram probability.
Using the recursion of Equations 1 and 2, the probability for the N-gram becomes an interpolation of relative frequencies for different orders of n-grams that are below the N-gram of interest. For example, for a trigram, the recursive interpolation produces:
Where P(vk|vk−2vk−1) is the trigram probability, f(vk|vk−2vk−1) is the relative frequency of the trigram in a training text, f(vk|vk−1) is the relative frequency of a bigram in the training text, f(vk) is the relative frequency of the unigram in the training text, |V| is the number of vocabulary words in the language model, and λ2, λ1, λ0 are the context-dependent interpolation weights.
Under some embodiments, the counts used to determine the relative frequencies are not limited to integer valued counts and may include fractional values that are computed as the expected values of the counts. This is one advantage of deleted interpolation over other back−off methods, such as the Katz back-off method, which cannot be used on fractional (real valued) counts.
For example, beginning at node 200, the interpolated unigram probability is determined as the weighted sum of the unigram relative frequency 202 and the default unigram probability 204 where weight 206 (1−λ0) is applied to relative frequency 202 and weight 208 (λ0) is applied to the default unigram probability 204.
The probability at the next higher node 210 is the weighted sum of the relative frequency 212 for the bigram and the unigram probability of node 204. A weight 214 (λ1(vk−1)), which is a function of the context of the bigram, is applied to the unigram probability of node 204 while a weight 216 (1−λ1(vk−1)) is applied to the relative frequency 212.
This recursive summation continues upward until it reaches node 220 for the N-gram probability. The probability determined for node 220 is the weighted sum of the probability determined at node 222 for the next lower order n-gram and the relative frequency 224 of the N-gram, where the weight 226 applied to the lower order probability is λ(n−1)(vk−(n−1) . . . vk−1) and the weight 228 applied to the relative frequency is 1−λ(n−1)(vk−(n−1) . . . vk−1), which are both dependent on the context of the N-gram.
As can be seen from
In step 400 of
At step 404, the relative frequencies 308 are applied to an EM trainer 310, which uses an expectation maximization algorithm to set the values for the weights, λn−1 . . . λ0, so as to maximize the total probability of all of the highest order N-grams such that:
Where [λn−1 . . . λ0] is the set of weights that maximize the probabilities of the highest order N-grams where the total probability is the product of the individual probabilities for each ith N-gram, where each individual probability is calculated using the recursive interpolation of Equations 1 and 2.
As noted above, the weights are functions of the contexts of the n-gram probabilities they are used to determine. To counteract data sparseness (which would lead to unreliable estimates) and at the same time reduce the computational complexity of the EM training, these weights are grouped into buckets based on the frequency counts of the context. Under one embodiment, ranges of frequency counts are grouped into the same weights. Thus, one λn−1 may be for contexts that are seen between 16 and 32 times and one λn−1 may be for contexts that are seen between 33 and 64 times. This results in a smaller set of weights that need to be trained and a smaller set of training text that is needed for training.
Note that since the weights are maximized against check data 304, there will be n-grams in check data 304 that were not observed in main data 302. Thus, the weights are set to anticipate unseen data.
Under some embodiments, the training text 300 may be re-segmented in a different manner and the relative frequency counts may be re-determined for the new grouping of text. These new frequency counts may then be applied to the EM trainer 310 to re-determine the values of the weights. When re-determining the values of the weights, the algorithm begins with the estimates of the weights determined at the previous iteration. Such iterations may be repeated until the weights reach stable values. After the desired number of iterations has been formed, a set of weights 312 is stored together with the final set of relative frequency counts 308 as a deleted interpolation model 314 at step 406. This deleted interpolation model may be used to determine probabilities for new text by parsing the text into the different order n-grams, searching for the appropriate weights for each of the contexts and performing the calculation of the interpolated probability using Equations 1 and 2.
The interpolation represented by Equations 1 and 2 is substantially different from the techniques used with the more widely accepted backoff language models, which are typically represented in the standard ARPA format. Instead of using a linear interpolation to determine the probability for an N-gram, the more widely accepted backoff language models use a substitute probability for any N-gram that cannot be located in the model. This substitute probability is based on a lower order model and a backoff weight associated with the context of the probability that can not be located. Thus, instead of performing an interpolation, the more standard backoff language models simply replaces an N-gram probability with a lower order n-gram probability.
Once a probability is found for an n-gram at step 506, the probability is multiplied by all of the backoff weights that were encountered in iterations through steps 504 and 506 to form the probability for the N-gram at step 508.
As can be seen in
Below list 604 are a set of section, with one section for each order of n-gram. Each section is headed with a separate tag such as tag 610 for unigrams, tag 612 for bigrams, and tag 614 for N-grams, where N is the top order of n-grams in the language model.
Below each heading for the different orders of n-grams, there is a list of entries, one for each n-gram of that order. Each entry includes the probability of the n-gram, the n-gram, and for n-grams of orders other than the top order, a backoff weight. For example, under unigram heading 610, entry 618 includes a probability 622 for a unigram 620 and a backoff weight 616. Note that backoff weight 616 is associated with word 620 when word 620 is used as a context in a bigram. Similarly, entry 624 under bigram heading 612 includes a bigram probability 626 for the bigram 628 consisting of words v1v2 and a backoff weight 630 associated with words v1v2 being used has the context of a trigram. Typically, the probabilities and the backoff weights are stored in log base 10 format.
For entries under top order n-gram heading 614, there are no backoff weights. Thus, for entry 632 there is just a probability 634, and an n-gram v1 . . . vn 636.
In step 700 of
If the relative frequency of the L-gram is not greater than zero, the probability for the L-gram is not stored in the standard ARPA format.
After the probability for the L-gram has been stored at step 708 or after it has been determined that the relative frequency of the L-gram is not greater than zero, the method determines if there are more L-grams to consider for the top order of L-grams at step 710. If there are more L-grams to consider, the processes returns to step 704 and selects the next L-gram. Steps 706 and 708 are then repeated for this new L-gram. Steps 704, 706, 708, and 710 are repeated until all the L-grams of the top order have been processed.
Once all of the L-grams for the top order of L-grams have been processed at step 710, the method determines if the current order of L-grams being processed is greater than zero at step 712. If the order of L-grams currently being processed is greater than zero, the order is reduced by one to move to the next lower order at step 714. An L-gram at this next lower order is then selected at step 716.
At step 718, the relative frequency of the selected L-gram is examined to determine if it is greater than zero. If it is not greater than zero, the process continues at step 720 where the higher order L-grams previously stored in the ARPA file are examined to determine if the present L-gram is a context of one of the higher order L-grams. If the L-gram is found as a context in a higher order L-gram at step 720 or the relative frequency of the L-gram is greater than zero at step 718, the interpolated probability of the L-gram is stored as the probability of the L-gram in the ARPA file and the λ that is a function of the L-gram in the deleted interpolation model is stored as the backoff weight for the L-gram at step 722. For example, if the λ's are functions of the relative frequencies of the L-grams, the λ associated with the relative frequency of the current L-gram is stored as the backoff weight. For example, if the L-gram is the bigram v1v2, the weight associated with bigrams that have a relative frequency equal to the relative frequency of v1v2 is used as the backoff weight for the bigram v1v2 and the interpolated probability is used as the probability of the bigram v1v2.
Thus, an L-gram is stored if its relative frequency is greater than zero, i.e. it was seen in the training data, and if it appears as a context for a higher order L-gram. By limiting the L-grams that are stored to those that meet these criteria, this embodiment of the present invention creates a compact language model in the backoff format.
An L-gram can appear as a context while having a relative frequency of zero in the training text if the relative frequencies are determined by setting the relative frequencies to zero if their initial relative frequency is below a threshold. For example, if an L-gram has a relative frequency of 0.02 and the threshold is set to 0.02, the relative frequency for the L-gram would be set to zero. This is done to reduce the size of the interpolation model.
The reason for storing an L-gram if it appears as a context in a higher order L-gram even though it has a relative frequency of zero is that since the L-gram appears as a context for a higher order L-gram, a backoff weight for this context will be needed in the language model.
After step 722 or if the current selected L-gram does not have a relative frequency greater than zero at step 718 and is not used as a context of a higher order L-gram at step 720, the process determines if there are more L-grams of the current order at step 724. If there are more L-grams at the current order, the next L-gram is selected at step 716 and steps 718, 720, 722, and 724 are repeated. Steps 716, 718, 720, 722 and 724 are repeated until all of the L-grams of the selected order have been processed.
When there are no more L-grams for the current order at step 724, the process returns to step 712 to determine if the order is greater than zero. If the order is greater than zero, the next lower order is selected at step 714 and steps 716-724 are repeated for the L-grams in the new lower order. When the order is no longer greater than zero at step 712, all of the orders of n-grams have been processed and the method of
Thus, the method of
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5258909 *||Aug 31, 1989||Nov 2, 1993||International Business Machines Corporation||Method and apparatus for "wrong word" spelling error detection and correction|
|US5267345 *||Feb 10, 1992||Nov 30, 1993||International Business Machines Corporation||Speech recognition apparatus which predicts word classes from context and words from word classes|
|US5444617 *||Dec 14, 1993||Aug 22, 1995||International Business Machines Corporation||Method and apparatus for adaptively generating field of application dependent language models for use in intelligent systems|
|US5467425 *||Feb 26, 1993||Nov 14, 1995||International Business Machines Corporation||Building scalable N-gram language models using maximum likelihood maximum entropy N-gram models|
|US6188976||Oct 23, 1998||Feb 13, 2001||International Business Machines Corporation||Apparatus and method for building domain-specific language models|
|EP0805434A2||Apr 29, 1997||Nov 5, 1997||Microsoft Corporation||Method and system for speech recognition using continuous density hidden Markov models|
|1||Bacchiani et al., "MAP Adaptation of Stochastic Grammars," Computer Speech and Language, Elsevier, London, Jan. 11, 2005.|
|2||C. Crespo, D. Tapias, G. Escalada, and J. Alvarez, 1997, "Language Model Adaptation for Conversational Speech Recognition Using Automatically Tagged Pseudo-Morphological Classes," In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 823-826.|
|3||Chelba, et al., "Speech Utterance Classification," 2003 IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, Apr. 6-10, 2003.|
|4||Chinese First Office Action from Application No. 200510060153.6, filed Mar. 22, 2005.|
|5||Cyril Allauzen, Mehryar Mohri, and Brian Roard, 2003, "Generalized Algorithms for Constructing Language Models," To appear in Proceedings of the 41<SUP>st </SUP>Annual Meeting of the Association for Computational Linguistics (ACL).|
|6||Dempster, A.P., et al., "Maximum Likelihood from Incomplete Data Via the EM Agorithm," Journal of the Royal Statistical Society, vol. 39 of B, pp. 1-38, 1977.|
|7||European Search Report from Application No. 05102283.8, filed Mar. 22, 2005.|
|8||F. Jelinek, B. Merialdo, S. Roukos, and M. Strauss, 1991, "A Dynamic Language Model for Speech Recognition," In Proceedings of Speech and Natural Language DARPA Workshop, pp. 293-295.|
|9||Frederick Jelinek and Robert Mercer, 1980, "Interpolated Estimation of Markov Source Parameters from Sparse Data," In E. Gelsema and L. Kanal, editors, Pattern Recognition In Practice, pp. 387-397.|
|10||H. Erdongan, "Semantic Structural Language Models," ICSLP 2002 Proceedings.|
|11||Huang et al., "Deleted Interpolation And Density Sharing for Continuous Hidden Markov Models," 1996 IEEE International Conference On Acoustics, Speech and Signal Processing, Atlanta, May 7-10, 1996.|
|12||J. Zhang et al., "Improvements in Audio Processing and Language Modeling in the CU Communicator," Eurospeech 2001.|
|13||Jelinek, F., et al., "Interpolated Estimation of Markov Source Parameters from Sparse Data," E. Gelsema and L. Kanal, editors, Pattern Recognition in Practice, pp. 381-397, 1980.|
|14||Office Action from the Chinese Patent Office in foreign application No. 200510060153.6 Mar. 25, 2005.|
|15||Paul, D.B., et al, "The Design for the Wall Street Journal-Based CSR Corpus," Proceedings of the DARPA SLS Workshop, Feb. 1992.|
|16||Renato DeMori and Marcello Federico, 1999, "Language Model Adaptation," In K. Ponting, editor, Computational Models of Speech Pattern Processing, pp. 280-303, Springer Verlag, Berlin, New York.|
|17||S. Katz, "Estimation of Probabilities From Sparse Data for the Language Model Component of a Speech Recognizer." In IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 35, pp. 400-401, Mar. 1987.|
|18||S. Mori, M. Nishimura, and N. Itoh, 2003, "Language Model Adaptation Using Word Clustering," In Proceedings of the European Conference on Speech Communication and Technology, pp. 425-428.|
|19||Seymore, K., et al., "Scalable Back-Off Language Models," Proceedings ICSLP, vol. 1, pp. 232-235, Philadelphia, 1996.|
|20||Stolcke, A., "Entropy-Based Pruning of Back-Office Language Models," Proceedings of News Transcription and Understanding Workshop, pp. 270-274, Lansdowne, VA 1998, DARPA.|
|21||Wu et al., "Improved Katz Smoothing for Language Modeling in Speech Recognition," ICSLP 2002: 7<SUP>th </SUP>International Conference on Spoken Language Processing, Denver, Colorado, Sep. 16-20, 2002.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8332207 *||Jun 22, 2007||Dec 11, 2012||Google Inc.||Large language models in machine translation|
|US8700404 *||Aug 27, 2005||Apr 15, 2014||At&T Intellectual Property Ii, L.P.||System and method for using semantic and syntactic graphs for utterance classification|
|US8798983 *||Mar 30, 2009||Aug 5, 2014||Microsoft Corporation||Adaptation for statistical language model|
|US8812291 *||Dec 10, 2012||Aug 19, 2014||Google Inc.||Large language models in machine translation|
|US20070078653 *||Oct 3, 2005||Apr 5, 2007||Nokia Corporation||Language model compression|
|US20080243481 *||Jun 22, 2007||Oct 2, 2008||Thorsten Brants||Large Language Models in Machine Translation|
|US20080282154 *||Sep 11, 2006||Nov 13, 2008||Nurmi Mikko A||Method and apparatus for improved text input|
|US20100250251 *||Mar 30, 2009||Sep 30, 2010||Microsoft Corporation||Adaptation for statistical language model|
|US20130346059 *||Dec 10, 2012||Dec 26, 2013||Google Inc.||Large language models in machine translation|
|US20150088511 *||Sep 24, 2013||Mar 26, 2015||Verizon Patent And Licensing Inc.||Named-entity based speech recognition|
|U.S. Classification||704/240, 704/9, 704/255, 704/243|
|International Classification||G06F17/27, G10L15/00, G06F17/28, G10L15/28, G10L15/18|
|Cooperative Classification||G06F17/277, G10L15/197|
|Mar 26, 2004||AS||Assignment|
Owner name: MICROSOFT CORPORATION, WASHINGTON
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHELBA, CIPRIAN;MAHAJAN, MILIND;ACERO, ALEJANDRO;REEL/FRAME:015159/0573;SIGNING DATES FROM 20040324 TO 20040325
|Sep 21, 2011||FPAY||Fee payment|
Year of fee payment: 4
|Dec 9, 2014||AS||Assignment|
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034541/0477
Effective date: 20141014