« PreviousContinue »
(12) United States Patent
(io) Patent No.: (45) Date of Patent:
US 7,165,028 B2 Jan. 16, 2007
(54) METHOD OF SPEECH RECOGNITION RESISTANT TO CONVOLUTIVE DISTORTION AND ADDITIVE DISTORTION
(75) Inventor: Yifan Gong, Piano, TX (US)
(73) Assignee: Texas Instruments Incorporated,
Dallas, TX (US)
( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 848 days.
(21) Appl. No.: 10/251,734
(22) Filed: Sep. 20, 2002
(65) Prior Publication Data
US 2003/0115055 Al Jun. 19, 2003
Related U.S. Application Data
(60) Provisional application No. 60/339,327, filed on Dec. 12, 2001.
(51) Int. CI.
(52) U.S. CI 704/233; 704/256
(58) Field of Classification Search 704/233,
See application file for complete search history. (56) References Cited
U.S. PATENT DOCUMENTS 4,630,304 A * 12/1986 Borth et al 381/94.3
4,905,288 A * 2/1990 Gerson et al 704/245
4,918,732 A * 4/1990 Gerson et al 704/233
4,959,865 A * 9/1990 Stettiner et al 704/233
5,263,019 A * 11/1993 Chu 370/288
5,924,065 A * 7/1999 Eberman et al 704/231
5,970,446 A * 10/1999 Goldberg et al 704/233
6,006,175 A * 12/1999 Holzrichter 704/208
6,202,047 Bl * 3/2001 Ephraim et al 704/256.6
6,219,635 Bl * 4/2001 Coulter et al 704/207
6,389,393 Bl * 5/2002 Gong 704/244
6,418,411 Bl * 7/2002 Gong 704/256.5
6,529,872 Bl * 3/2003 Cerisara et al 704/250
6,658,385 Bl * 12/2003 Gong et al 704/244
6,691,091 Bl * 2/2004 Cerisara et al 704/255
6,868,378 Bl * 3/2005 Breton 704/233
6,876,966 Bl * 4/2005 Deng et al 704/233
6,934,364 Bl * 8/2005 Ho 379/21
7,096,169 Bl * 8/2006 Crutchfield, Jr 703/7
7,103,541 Bl * 9/2006 Attias et al 704/233
* cited by examiner
Primary Examiner—Michael N. Opsasnick
(74) Attorney, Agent, or Firm—W. James Brady; Frederick
J. Telecky, Jr.
ENGY _0.8-1-1.2H -1.4
METHOD OF SPEECH RECOGNITION
RESISTANT TO CONVOLUTIVE
DISTORTION AND ADDITIVE DISTORTION
This application claims priority under 35 USC § 119(e)(1) of provisional application No. 60/339,327, filed Dec. 12, 2001.
FIELD OF INVENTION
This invention relates to speech recognition and more particularly to operation in an ambient noise environment (additive distortion) and channel changes (convolutive distortion) such as microphone changes.
BACKGROUND OF INVENTION
A speech recognizer trained with office environment speech data and operating in a mobile environment may fail due at least to at least two distortion sources. A first is background noise such as from a computer fan, car engine, road noise. The second is microphone changes such as from hand-held to hands-free or from the position to the mouth. In mobile applications of speech recognition, both microphone and background noise are subject to change. Therefore, handling simultaneously the two sources of distortion is critical to performance.
The recognition failure can be reduced by retraining the recognizer's acoustic model using large amounts of training data collected under conditions as close as possible to the testing data. There are several problems associated with this approach.
Collecting a large database to train speaker-independent HMMs is very expensive.
It is not easy to determine if the collected data can cover all future noisy environments.
The recognizer has to spend large number of parameters to cover collectively the different environments.
The average of a variety of data results in flat distribution of models, which degrades the recognition of clean speech.
Cepstral Mean Normalization (CMN) removes utterance mean and is a simple and efficient way of dealing with convolutive distortion such as telephone channel distortion. This is described by B. Atal in "Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification", Journal of Acoust. Society America, 55:1304-1312,1974. Spectral subtraction (SS) reduces background noise in the feature space. Described by S. F. Boll in "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. On Acoustics Speech and Signal Processing, ASSP-27(2): 113-120, April 1979. Parallel Model Combination (PMC) gives an approximation of speech models in noisy conditions from noise-free speech models and noise estimates as described by M. J. F. Gales and S. Young in "An Improved Approach to the Hidden Markov Model Decomposition of Speech and Noise", In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Volume I, pages 233-236, U.S.A., April 1992. The technique is effective for speech recognition in noisy environments with a fixed microphone. These techniques do not require any training data. However, they only deal with either convolutive (channel, microphone) distortion or additive (background noise) distortion.
Joint compensation of additive noise and convolutive noise can be achieved by introduction of a channel model and a noise model. A spectral bias for additive noise and a
cepstral bias for convolutive noise are described in article entitled "A General Joint Additive and Convolutive Bias Compensation Approach Applied to Noisy Lombard Speech Recognition" in IEEE Transactions on Speech and Audio
5 Processing by M. Afify, Y. Gong and J. P. Haton, in 6(6): 524-538, November 1998. The two biases can be calculated by application of EM (estimation maximization) in both spectral and convolutive domains. The magnitude response of the distortion channel and power spectrum of the additive
10 noise can be estimated by an EM algorithm, using a mixture model of speech in the power spectral domain. This is described in an article entitled "Frequency-Domain Maximum Likelihood Estimation for Automatic Speech Recog
15 nition in Additive and Convolutive Noises" by Y. Zhao in IEEE Transaction on Speech and Audio Processing. 8(3): 255-266, May 2000. A procedure to calculate the convolutive component, which requires rescanning of training data is presented by J. L. Gauvain et al. in an article entitled
20 "Developments in Continuous Speech Dictation Using the ARPA NAB News Task" published in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 73-76, Detroit, 1996. Solution of the convolutive component by steepest descent methods is
25 reported in article of Y. Minami and S. Furui entitled "A Maximum Likelihood Procedure for a Universal Adaptation Method Based on HMM Composition" published in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 128-132, Detroit,
30 1995. The method described by Y. Minami and S. Furui entitled "Adaptation Method Based on HMM Composition and EM Algorithm" in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing,
35 pages 327-330, Atlanta 1996 needs additional universal speech models, and re-estimation of channel distortion with the universal models when the channel changes. A technique presented by M. J. F. Gales in Technical Report TR-154, CUED/F-INFENG, entitled "PMC for Speech Recognition
4Q in Additive and Convolutive Noise" December 1993 needs two passes of the test utterance; e.g. parameter estimation followed by recognition, several transformations between cepstral and spectral domains, and a Gaussian mixture model for clean speech.
45 Alternatively, the nonlinear changes of both type of distortions can be approximated by linear equations, assuming that the changes are small. A Jacobian approach described by S. Sagayama et al. in an article entitled "Jacobian Adaptation of Noisy Speech Models" in Proceed
50 ings of IEEE Automatic Speech Recognition Workshop, pages 396-403, Santa Barbara, Calif, USA, December 1997, IEEE Signal Processing Society. This Jacobian approach models speech model parameter changes as the product of a Jacobian matrix and the difference in noisy
55 conditions, and statistical linear approximation described by N. S. Kim are along this direction. The statistical linear approximation of Kim is found in IEEE Signal Processing Letters, 5(1): 8-10, January 1998 and entitled "Statistical Linear Approximation for Environment Compensation."
60 Finally, Maximum Likelihood Linear Regression (MLLR) transforms HMM parameters to match the distortion factors. See article of C. J. Leggetter et al entitled "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs", in Computer,
65 Speech and language, 9(2): 171-185,1995. MLLR does not model explicitly channel and background noise, but approximates their effect by piece-wise linearity. When given