The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems... |
Citations|
| US4401855 | Nov 28, 1980 | Aug 30, 1983 | The Regents of the University of California | Apparatus for the linear predictive coding of human speech | | US4862503 | Jan 19, 1988 | Aug 29, 1989 | Syracuse University | Voice parameter extractor using oral airflow | | US4959865 | Feb 3, 1988 | Sep 25, 1990 | The DSP Group, Inc. | A method for indicating the presence of speech in an audio signal | | US5171930 | Sep 26, 1990 | Dec 15, 1992 | Synchro Voice Inc. | Electroglottograph-driven controller for a MIDI-compatible electronic music synthesizer device | | US5251261 | Dec 3, 1990 | Oct 5, 1993 | U.S. Philips Corporation | Device for the digital recording and reproduction of speech signals | | US5326349 | Jul 9, 1992 | Jul 5, 1994 | | Artificial larynx | | US5454375 | Oct 21, 1993 | Oct 3, 1995 | Glottal Enterprises | Pneumotachograph mask or mouthpiece coupling element for airflow measurement during speech or singing | | US5473726 | Jul 6, 1993 | Dec 5, 1995 | The United States of America as represented by the Secretary of the Air Force | Audio and amplitude modulated photo data collection for speech recognition | | US5522013 | Feb 13, 1995 | May 28, 1996 | Nokia Telecommunications Oy | Method for speaker recognition using a lossless tube model of the speaker's | | US5528726 | May 8, 1995 | Jun 18, 1996 | The Board of Trustees of the Leland Stanford Junior University | Digital waveguide speech synthesis system and method | | US5573012 | Aug 9, 1994 | Nov 12, 1996 | The Regents of the University of California | Body monitoring and imaging apparatus and method | | US5659658 | Dec 2, 1994 | Aug 19, 1997 | Nokia Telecommunications OY | Method for converting speech using lossless tube models of vocals tracts | | US5668925 | Jun 1, 1995 | Sep 16, 1997 | Martin Marietta Corporation | Low data rate speech encoder with mixed excitation | | US5717828 | Mar 15, 1995 | Feb 10, 1998 | Syracuse Language Systems | Speech recognition apparatus and method for learning | | US5794203 | Mar 22, 1994 | Aug 11, 1998 | | Biofeedback system for speech disorders | | US6081776 | Jul 13, 1998 | Jun 27, 2000 | Lockheed Martin Corp. | Speech coding system and method including adaptive finite impulse response filter | | US6240386 | Nov 24, 1998 | May 29, 2001 | Conexant Systems, Inc. | Speech codec employing noise classification for noise compensation | | US6411925 | Sep 30, 1999 | Jun 25, 2002 | Canon Kabushiki Kaisha | Speech processing apparatus and method for noise masking | | US6415253 | Feb 19, 1999 | Jul 2, 2002 | Meta-C Corporation | Method and apparatus for enhancing noise-corrupted speech | | US6526139 | Nov 3, 2000 | Feb 25, 2003 | Tellabs Operations, Inc. | Consolidated noise injection in a voice processing system | | US6526376 | Feb 22, 2000 | Feb 25, 2003 | University of Surrey | Split band linear prediction vocoder with pitch extraction | | US6560575 | Sep 30, 1999 | May 6, 2003 | Canon Kabushiki Kaisha | Speech processing apparatus and method |
Referenced by|
| US6999924 | Jul 11, 2002 | Feb 14, 2006 | The Regents of the University of California | System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech | | US7574008 | Sep 17, 2004 | Aug 11, 2009 | Microsoft Corporation | Method and apparatus for multi-sensory speech enhancement |
Claims1. A method for removing acoustic noise from speech, comprising the steps of: - obtaining a speech excitation function using an EM sensor;
- identifying a first voiced-excitation onset time from the excitation function;
- obtaining an acoustic speech signal, corresponding to the speech excitation function;
- subtracting a first predetermined unvoiced time period from the first voiced-excitation onset time to obtain a corresponding first unvoiced-acoustic onset time within the acoustic speech signal;
- defining a no-speech time period prior to the first unvoiced-acoustic onset time;
- measuring a acoustic noise within the no-speech time period; and
- reducing the acoustic noise in the acoustic speech signal.
2. The method of claim 1 wherein the step of defining a no-speech time period comprises the steps of: - identifying an initialization time prior to a beginning of speech; and
- defining the no-speech time period between the initialization time and the first unvoiced-acoustic onset time.
3. The method of claim 1 wherein the step of defining a no-speech time period comprises the steps of: - identifying a second voiced-excitation end-time from the excitation function;
- adding a second predetermined unvoiced time period to the second voiced-excitation end-time to obtain a corresponding second unvoiced-acoustic end-time within the acoustic speech signal; and
- defining the no-speech time period as a period between second unvoiced-acoustic end-time and the first unvoiced-acoustic onset time.
4. The method of claim 1: - further comprising the step of, constructing an acoustic noise filter over a first set of time frames; and
- wherein the reducing step is comprised of the steps of,
- selecting an acoustic noise filter corresponding to the acoustic speech signal and the acoustic noise; and
- filtering the acoustic noise from the acoustic signal over a second set of time frames using the acoustic noise filter.
5. The method of claim 1 further comprising the steps of: - defining time frames within the excitation function and acoustic speech signal based on glottal tissue configuration; and
- identifying a subset of the time frames where the excitation function is substantially constant; and
- wherein the reducing step is comprised of the step of averaging amplitudes of the acoustic speech signal over the subset.
6. The method of claim 1: - wherein the obtaining an acoustic speech signal step includes the step of capturing the acoustic speech signal with an audio system; and
- further comprising steps of,
- identifying positive growing instabilities in the acoustic speech signal; and
- damping the instabilities by adjusting the audio system.
7. The method of claim 1: - wherein the measuring acoustic noise step, includes the step of,
- detecting an echo signal within the acoustic speech signal corresponding to a voiced-speech portion of the acoustic speech signal;
- further including the step of,
- identifying a portion of the acoustic speech signal corresponding to the echo signal; and
- wherein the reducing step, includes the step of,
- sign, amplitude, and phase adjusting the portion of the acoustic speech signal to cancel the echo signal.
8. The method of claim 1: - wherein the measuring acoustic noise step, includes the step of,
- measuring background acoustic-noise with a microphone; and
- wherein the reducing step, includes the step of,
- sign, amplitude, and phase adjusting the background acoustic-noise to reduce the acoustic noise.
9. The A method for removing acoustic noise from an acoustic speech signal, comprising the steps of: - selecting a first set of acoustic speech time frames with timing defined by an excitation function determined using an EM sensor;
- characterizing qualities of an acoustic noise signal over a second set of time frames with timing defined by an excitation function determined using the EM sensor and by using the acoustic speech signal over said second set of time frames;
- constructing an acoustic noise filter appropriate to the acoustic speech signal over the first set of time frames and to the characterized noise signal over the second set of time frames; and
- filtering the acoustic noise signal from the acoustic speech signal over the first set of time frames using the acoustic noise filter, wherein:
- the characterizing step includes the step of characterizing the qualities of the acoustic noise signal over the first set of time frames; and
- the constructing step includes the step of constructing the acoustic noise filter using both acoustic speech signal and noise signal information over the first set of time frames, and
- wherein the constructing step further includes the steps of:
- selecting a first set of acoustic speech time frames corresponding to a set of voiced speech excitation functions;
- constructing a speech band-pass filter using spectral information of the voiced speech excitation function obtained using the EM sensor over the first set of time frames;
- characterizing the acoustic noise over the first set of acoustic speech time frames using the acoustic-signal spectral-information excluded by the speech band-pass filter that is constructed using spectral information of the voiced speech excitation function;
- constructing the acoustic noise filter over the first set of time frames by using the band-pass filter and the characterized acoustic noise; and
- filtering the acoustic noise from the acoustic signal over the first set of time frames using the acoustic noise filter.
10. The A method for removing acoustic noise from an acoustic speech signal, comprising the steps of: - selecting a first set of acoustic speech time frames with timing defined by an excitation function determined using an EM sensor;
- characterizing qualities of an acoustic noise signal over a second set of time frames with timing defined by an excitation function determined using the EM sensor and by using the acoustic speech signal over said second set of time frames;
- constructing an acoustic noise filter appropriate to the acoustic speech signal over the first set of time frames and to the characterized noise signal over the second set of time frames; and
- filtering the acoustic noise signal from the acoustic speech signal over the first set of time frames using the acoustic noise filter,
- partitioning the acoustic speech signal into time frames;
- calculating an acoustic speech signal energy;
- calculating an excitation function energy;
- averaging the acoustic speech signal energy over a subset of the time frames;
- averaging the excitation function energy over the subset of the time frames; and
- replacing a portion of the acoustic speech signal in a first time frame with a portion of the acoustic speech signal in a second time frame, if a change in the acoustic speech signal energy in the first time frame exceeds a predetermined threshold, and if the corresponding excitation energy remains constant within predetermined threshold levels.
11. A system for removing acoustic noise from speech, comprising: - an EM sensor for generating a speech excitation function from measured movements of a predetermined portion of a vocal tract;
- an acoustic sensor receiving an acoustic speech signal, corresponding to the speech excitation function from the vocal tract; and
- a computer for,
- identifying a first voiced-excitation onset time from the excitation function,
- subtracting a first predetermined unvoiced time period from the first voiced-excitation onset time to obtain a corresponding first unvoiced-acoustic onset time within the acoustic speech signal;
- defining a no-speech time period prior to the first unvoiced-acoustic onset time;
- measuring acoustic noise within the no-speech time period; and
- reducing the acoustic noise in the acoustic speech signal.
|