WO2004066273A1

WO2004066273A1 - Noise reduction and audio-visual speech activity detection

Info

Publication number: WO2004066273A1
Application number: PCT/EP2004/000104
Authority: WO
Inventors: Morio Taneda
Original assignee: Sony Ericsson Mobile Communications Ab
Priority date: 2003-01-24
Filing date: 2004-01-09
Publication date: 2004-08-05
Also published as: IL169550A; US20060224382A1; EP1443498B1; US7684982B2; EP1443498A1

Abstract

The present invention generally relates to the field of noise reduction systems which are equipped with an audio-visual user interface, in particular to an audio-visual speech activity recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise (n(t)) and surrounding persons' voices.

Description

Noise Reduction and Audio-Nisual Speech Activity Detection

FIELD AND BACKGROUND OF THE INVENTION

The present invention generally relates to the field of noise reduction based on speech ac- tivity recognition, in particular to an audio-visual user interface of a telecommunication device running an application that can advantageously be used e.g. for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise including environmental noise as well as surrounding persons' voices.

Discontinuous transmission of speech signals based on speech/pause detection represents a valid solution to improve the spectral efficiency of new-generation wireless communication systems, hi this context, robust voice activity detection algorithms are required, as conventional solutions according to the state of the art present a high misclassification rate in the presence of the background noise typical of mobile environments.

A voice activity detector (NAD) aims to distinguish between a speech signal and several types of acoustic background noise even with low signal-to-noise ratios (STMRs). Therefore, in a typical telephone conversation, such a VAD, together with a comfort noise generator (CΝG), is used to achieve silence compression. In the field of multimedia communications, silence compression allows a speech channel to be shared with other types of information, thus guaranteeing simultaneous voice and data applications. In cellular radio systems which are based the Discontinuous Transmission (DTX) mode, such as GSM, NADs are applied to reduce co-channel interference and power consumption of the portable equipment. Fur- thermore, a NAD is vital to reduce the average data bit rate in future generations of digital cellular networks such as the UMTS, which provide for a variable bit-rate (NBR) speech coding. Most of the capacity gain is due to the distinction between speech activity and in- activity. The performance of a speech coding approach which is based on phonetic classification, however, strongly depends on the classifier, which must be robust to every type of background noise. As is well known, the performance of a NAD is critical for the overall speech quality, in particular with low SΝRs. In case speech frames are detected as noise, intelligibility is seriously impaired owing to speech clipping in the conversation. If, on the other hand, the percentage of noise detected as speech is high, the potential advantages of silence compression are not obtained. In the presence of background noise it may be difficult to distinguish between speech and silence. Hence, for voice activity detection in wireless environments more efficient algorithms are needed.

Although the Fuzzy Voice Activity Detector (FNAD) proposed in „Improved NAD G.729 Annex B for Mobile Communications Using Soft Computing" (Contribution ITU-T, Study Group 16, Question 19/16, Washington, September 2-5, 1997) by F. Beritelli, S. Casale, and A. Cavallaro performs better than other solutions presented in literature, it exhibits an activity increase, above all in the presence of non-stationary noise. The functional scheme of the FNAD is based on a traditional pattern recognition approach wherein the four differential parameters used for speech activity/inactivity classification are the full-band energy difference, the low-band energy difference, the zero-crossing difference, and the spectral distortion. The matching phase is performed by a set of fuzzy rules obtained automatically by means of a new hybrid learning tool as described in „FuGeΝeSys: Fuzzy Genetic Neural System for Fuzzy Modeling" by M. Russo (to appear in IEEE Transaction on Fuzzy Systems). As is well known, a fuzzy system allows a gradual, continuous transition rather than a sharp change between two values. Thus, the Fuzzy NAD returns a continuous output signal ranging from 0 (non-activity) to 1 (activity), which does not depend on whether single input signals have exceeded a predefined threshold or not, but on an overall evaluation of the values they have assumed (,,defuzzyfication process"). The final decision is made by comparing the output of the fuzzy system, which varies in a range between 0 and 1, with a fixed threshold experimentally chosen as described in "Voice Control of the Pan-European Digital Mobile Radio System" (ICC '89, pp. 1070-1074) by C. B. Southcott et al.

Just as voice activity detectors conventional automatic speech recognition (ASR) systems also experience difficulties when being operated in noisy environments since accuracy of conventional ASR algorithms largely decreases in noisy environments. When a speaker. is talking in a noisy environment including both ambient noise as well as surrounding persons' interfering voices, a microphone picks up not only the speaker's voice but also these background sounds. Consequently, an audio signal which encompasses the speaker's voice superimposed by said background sounds is processed. The louder the interfering sounds, the more the acoustic comprehensibility of the speaker is reduced. To overcome this problem, noise reduction circuitries are applied that take use of the different frequency regions of environmental noise and the respective speaker's voice.

A typical noise reduction circuitry for a telephony-based application based on a speech activity estimation algorithm according to the state of the art that implements a method for correlating the discrete signal spectrum S(k-Af) of an analog-to-digital-converted audio signal s(t) with an audio speech activity estimate is shown in Fig. 2a. Said audio speech activity estimate is obtained by an amplitude detection of the digital audio signal s(nT). The circuit outputs a noise-reduced audio signal s_t(nT) , which is calculated by subjecting the difference of the discrete signal spectrum S(k-Δf) and a sampled version Φ _m(k • Δ ) of the estimated noise power density spectrum Φ„„ (f) of a statistically distributed background noise n (t) to an Inverse Fast Fourier Transform (IFFT).

BRIEF DESCRIPTION OF THE STATE OF THE ART

The invention described in US 5,313,522 refers to a device for facilitating comprehension by a hearing-impaired person participating in a telephone conversation, which comprises a circuitry for converting received audio speech signals into a series of phonemes and an arrangement for coupling the circuitry to a POTS line. The circuit thereby includes an arrangement which correlates the detected series of phonemes with recorded lip movements of a speaker and displays these lip movements in subsequent images on a display device, thereby permitting the hearing-impaired person to carry out a lipreading procedure while listening to the telephone conversation, which improves the person's level of comprehen- sion. The invention disclosed in WO 99/52097 pertains to a communication device and a method for sensing the movements of a speaker's lips, generating an audio signal corresponding to detected lip movements of said speaker and transmitting said audio signal, thereby sensing a level of ambient noise and accordingly controlling the power level of the audio signal to be transmitted.

OBJECT OF THE UNDERLYING INVENTION

In view of the state of the art mentioned above, it is the object of the present invention to enhance the speech/pause detection accuracy of a telephony-based voice activity detection (VAD) system. In particular, it is the object of the invention to increase the signal-to-interference ratio (SIR) of a recorded speech signal in crowded environments where a speaker's voice is severely interfered by ambient noise and/or surrounding persons' voices.

The aforementioned object is achieved by means of the features in the independent claims. Advantageous features are defined in the subordinate claims.

SUMMARY OF THE INVENTION

The present invention is dedicated to a noise reduction and automatic speech activity recognition system having an audio-visual user interface, wherein said system is adapted for running an application for combining a visual feature vector o_v,_nτ that comprises features extracted from a digital video sequence v(nT) showing a speaker's face by detecting and analyzing e.g. lip movements and/or facial expressions of said speaker S,- with an audio feature vector ___a,_nτ which comprises features extracted from a recorded analog audio sequence s(i). Said audio sequence s(t) thereby represents the voice of said speaker S, interfered by a statistically distributed background noise

n' t) = n(t) + _Int(t), (1)

which includes both environmental noise n(f) and a weighted sum of surrounding persons' interfering voices S_/,„( * ∑ <>_j - S_jd - T_j) (forj ≠ (2a)

7 = 1

in the environment of said speaker S,-. Thereby, N denotes the total number of speakers (inclusive of said speaker S,), α,- is the attenuation factor for the interference signal s t) of the y^'-th speaker S,- in the environment of the speaker S,, 2} is the delay of S_j(t), and RJ_M denotes the distance between the ^'-th speaker S,- and a microphone recording the audio signal s(t). By tracking the lip movement of a speaker, visual features are extracted which can then be analyzed and used for further processing. For this reason, the bimodal perceptual user interface comprises a video camera pointing to the speaker's face for recording a digital video sequence v(nT) showing lip movements and/or facial expressions of said speaker S_;-, audio feature extraction and analyzing means for determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on the recorded audio sequence s(t), and visual feature extraction and analyzing means for continuously or intermittently determining the current location of the speaker's face, tracking lip movements and/or facial expressions of the speaker in subsequent images and determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on the detected lip movements and/or facial expressions.

According to the invention, the aforementioned extracted and analyzed visual features are fed to a noise reduction circuit that is needed to increase the signal-to-interference ratio (SIR) of the recorded audio signal s(t). Said noise reduction circuit is specially adapted to perform a near-speaker detection by separating the speaker's voice from said background noise n (t) based on the derived acoustic-phonetic speech characteristics

It outputs a speech activity indication signal ( s. (nT) ) which is obtained by a combination of speech activity estimates supplied by said audio feature extraction and analyzing means as well as said visual feature extraction and analyzing means.

BRIEF DESCRIPTION OF THE DRAWINGS

Advantageous features, aspects, and useful embodiments of the invention will become evident from the following description, the appended claims, and the accompanying drawings. Thereby,

Fig. 1 shows a noise reduction and speech activity recognition system having an audiovisual user interface, said system being specially adapted for running a real-time lip tracking application which combines visual features o_v,_nT extracted from a digital video sequence v(nT) showing the face of a speaker S_z- by detecting and analyzing the speaker's lip movements and/or facial expressions with audio features o_a,_nτ extracted from an analog audio sequence ^(t) representing the voice of said speaker S; interfered by a statistically distributed background noise n'(t),

Fig. 2a is a block diagram showing a conventional noise reduction and speech activity recognition system for a telephony-based application based on an audio speech activity estimation according to the state of the art,

Fig. 2b shows an example of a camera- enhanced noise reduction and speech activity recognition system for a telephony-based application that implements an audiovisual speech activity estimation algorithm according to one embodiment of the present invention,

Fig. 2c shows an example of a camera- enhanced noise reduction and speech activity recognition system for a telephony-based application that implements an audiovisual speech activity estimation algorithm according to a further embodiment of the present invention, Fig. 3 a shows a flow chart illustrating a near-end speaker detection method reducing the noise level of a detected analog audio sequence s(t) according to the embodiment depicted in Fig. 1 of the present invention,

Fig. 3b shows a flow chart illustrating a near-end speaker detection method according to the embodiment depicted in Fig. 2b of the present invention, and

Fig. 3 c shows a flow chart illustrating a near-end speaker detection method according to the embodiment depicted in Fig. 2c of the present invention.

DETAILED DESCRIPTION OF THE UNDERLYING INVENTION

In the following, different embodiments of the present invention as depicted in Figs. 1, 2b, 2c, and 3a-c shall be explained in detail. The meaning of the symbols designated with reference numerals and signs in Figs. 1 to 3c can be taken from an annexed table.

According to a first embodiment of the invention as depicted in Fig. 1, said noise reduction and speech activity recognition system 100 comprises a noise reduction circuit 106 which is specially adapted to reduce the background noise n'(t) received by a microphone 101a and to perform a near-speaker detection by separating the speaker's voice from said background noise n'(t) as well as a multi-channel acoustic echo cancellation unit 108 being specially adapted to perform a near-end speaker detection and/or double-talk detection algorithm based on acoustic-phonetic speech characteristics derived with the aid of the afore- mentioned audio and visual feature extraction and analyzing means 104a+b and 106b, respectively. Thereby, said acoustic-phonetic speech characteristics are based on the opening of a speaker's mouth as an estimate of the acoustic energy of articulated vowels or diphthongs, respectively, rapid movement of the speaker's lips as a hint to labial or labio-dental consonants (e.g. plosive, fricative or affricative phonemes - voiced or unvoiced, respec- tively), and other statistically detected phonetic characteristics of an association between position and movement of the lips and the voice and pronunciation of a speaker S,-. The aforementioned noise reduction circuit 106 comprises digital signal processing means 106a for calculating a discrete signal spectrum S(k- f) that corresponds to an analog-to- digital-converted version s(nT) of the recorded audio sequence s(t) by performing a Fast Fourier Transform (FFT), audio feature extraction and analyzing means 106b (e.g. an am- plitude detector) for detecting acoustic-phonetic speech characteristics of a speaker's voice and pronunciation based on the recorded audio sequence s(t), means 106c for estimating the noise power density spectrum Φ„„ (/) of the statistically distributed background noise n'(t) based on the result of the speaker detection procedure performed by said audio feature extraction and analyzing means 106b, a subtracting element 106d for subtracting a discre- tized version Φ_m (k - Af) of the estimated noise power density spectrum Φ_m (f) from the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio sequence s(nT), and digital signal processing means 106e for calculating the corresponding discrete time- domain signal s_t(nT) of the obtained difference signal by performing an Inverse Fast Fourier Transform (IFFT).

The depicted noise reduction and speech activity recognition system 100 comprises audio feature extraction and analyzing means 106b which are used for determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation (o_a,nτ) based on the recorded audio sequence s(t) and visual feature extraction and analyzing means 104a+b for determining the current location of the speaker's face at a data rate of 1 frame/s, tracking lip movements and/or facial expressions of said speaker S,- at a data rate of 15 frames/s and determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on detected lip movements and/or facial expressions (o_v,„τ).

As depicted in Fig. 1, said noise reduction system 200b/c can advantageously be used for a video-telephony based application in a telecommunication system running on a video-enabled phone 102 which is equipped with a built-in video camera 101b' pointing at the face of a speaker S,- participating in a video telephony session.

Fig. 2b shows an example of a slow camera-enhanced noise reduction and speech activity recognition system 200b for a telephony-based application which implements an audio- visual speech activity estimation algorithm according to one embodiment of the present invention. Thereby, an audio speech activity estimate taken from an audio feature vector O_n,, supplied by said audio feature extraction and analyzing means 106b is correlated with a further speech activity estimate that is obtained by calculating the difference of the discrete signal spectrum S(k-Af) and a sampled version Φ_nn (k • Af) of the estimated noise power density spectrum Φ_m (f) of the statistically distributed background noise n t). Said audio speech activity estimate is obtained by an amplitude detection of the band-pass-filtered discrete signal spectrum S(k-Aj) of the analog-to-digital-converted audio signal s(t).

Similar to the embodiment depicted in Fig. 1, the noise reduction and speech activity recognition system 200b depicted in Fig. 2b comprises an audio feature extraction and analyzing means 106b (e.g. an amplitude detector) which is used for determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation (o_a,_nT) based on the recorded audio sequence s(t) and visual feature extraction and analyzing means 104' and 104" for determining the current location of the speaker's face at a data rate of 1 frame/s, tracking lip movements and facial expressions of said speaker S, at a data rate of 15 frames/s and determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on detected lip movements and/or facial expressions (o_v,„τ). Thereby, said audio feature extraction and analyzing means 106b can simply be realized as an amplitude detector.

Aside from the components 106a-e described above with reference to Fig. 1, the noise reduction circuit 106 depicted in Fig. 2b comprises a delay element 204, which provides a delayed version of the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t), a first multiplier element 107a, which is used for correlating (S9) the discrete signal spectrum S_τ(k-Af) of a delayed version s(nT-τ) of the analog-to-digital-converted audio signal s(nT) with a visual speech activity estimate taken from a visual feature vector θ _<t supplied by the visual feature extraction and analyzing means 104a+b and/or

104'+104", thus yielding a further estimate £,- '(/) for updating the estimate S_t(f) for the frequency spectrum Sι(f) corresponding to the signal s_t(t) that represents said speaker's voice as well as a further estimate Φ_nn '( ) for updating the estimate Φ_m (f) for the noise power density spectrum Φ„_n (f) of the statistically distributed background noise n t), and a second multiplier element 107, which is used for correlating (S8a) the discrete signal spectrum S_τ(k-Af) of a delayed version s(nT- ) of the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate obtained by an amplitude detection (S8b) of the band-pass-filtered discrete signal spectrum S(k-Af), thus yielding an estimate S.. ( ) for the frequency spectrum $( ) which corresponds to the signal st(t) that represents said speaker's voice and an estimate Φ„„ ( ) for the noise power density spectrum Φ,_m ( ) of said background noise n'(t). A sample-and-hold (S&H) element 106d' provides a sampled version Φ_m (k - Af) of the estimated noise power density spectrum Φ„„ (f) . The noise reduction circuit 106 further comprises a band-pass filter with adjustable cut-off frequencies, which is used for filtering the discrete signal spectrum S(k-Δf) of the analog-to-digital- converted audio signal s(t). The cut-off frequencies can be adjusted dependent on the bandwidth of the estimated speech signal spectrum -?,■(/) . A switch 106f is provided for selectively switching between a first and a second mode for receiving said speech signal 5j(t) with and without using the proposed audio-visual speech recognition approach providing a noise-reduced speech signal --?, (t) , respectively. According to a further aspect of the present invention, means are provided for switching said microphone 101a off when the actual level of the speech activity indication signal s . (nT) falls below a predefined threshold value (not shown).

An example of a fast camera-enhanced noise reduction and speech activity recognition system 200c for a telephony-based application which implements an audio-visual speech activity estimation algorithm according to a further embodiment of the present invention is depicted in Fig. 2c. The circuitry correlates a discrete signal spectrum S(k-Af) of the analog- to-digital-converted audio signal s(t) with a delayed version of an audio-visual speech activity estimate and a further speech activity estimate obtained by calculating the difference spectrum of the discrete signal spectrum S(k-Af) and a sampled version Φ_m (k ■ Af) of the estimated noise power density spectrum Φ„„( ) • The aforementioned audio-visual speech activity estimate is taken from an audio-visual feature vector o_av,_t obtained by combining an audio feature vector o_a,_t supplied by said audio feature extraction and analyzing means 106b with a visual feature vector o_v,t supplied by said visual speech activity detection module 104".

Aside from the components described above with reference to Fig. 1, the noise reduction circuit 106 depicted in Fig. 2c comprises a summation element 107c, which is used for adding (SI la) an audio speech activity estimate supplied from an audio feature extraction and analyzing means 106b (e.g. an amplitude detector) for determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation (o_a,_nT) based on the recorded audio sequence ,y(t) to an visual speech activity estimate supplied from visual fea- ture extraction and analyzing means 104' and 104" for determining the current location of the speaker's face at a data rate of 1 frame/s, tracking lip movements and facial expressions of said speaker S_;- at a data rate of 15 frames/s and determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on detected lip movements and/or facial expressions (o_v,_nT), thus yielding an audio-visual speech activity estimate. The noise reduction circuit 106 further comprises a multiplier element 107', which is used for correlating (SI lb) the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) with an audio- visual speech activity estimate, obtained by combining an audio feature vector o_a,t supplied by said audio feature extraction and analyzing means 106b with a visual feature vector o supplied by said visual speech activity detection mod- ule 104", thereby yielding an estimate S,. ( ) for the frequency spectrum St(f) which corresponds to the signal s_t(t) that represents the speaker's voice and an estimate Φ_nn (/) for the noise power density spectrum Φ_m (f) of the statistically distributed background noise n'(t). A sample-and-hold (S&H) element 106d' provides a sampled version Φ_nn (k ^■ Af) of the estimated noise power density spectrum Φ„„ (/) . The noise reduction circuit 106 fur- ther comprises a band-pass filter with adjustable cut-off frequencies, which is used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t). Said cut-off frequencies can be adjusted dependent on the bandwidth of the estimated speech signal spectrum S_t (f) . A switch 106f is provided for selectively switching between a first and a second mode for receiving said speech signal s_t(t) with and without us- ing the proposed audio-visual speech recognition approach providing a noise-reduced speech signal s_t(t) , respectively. According to a further aspect of the present invention, said noise reduction system 200c comprises means (SW) for switching said microphone 101a off when the actual level of the speech activity indication signal s_t(nT) falls below a predefined threshold value (not shown).

A still further embodiment of the present invention is directed to a near-end speaker detection method as shown in the flow chart depicted in Fig. 3a. Said method reduces the noise level of a recorded analog audio sequence s f) being interfered by a statistically distributed background noise n t), said audio sequence representing the voice of a speaker S_;-. After having subjected (SI) the analog audio sequence s(t) to an analog-to-digital conversion, the corresponding discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio sequence s(nT) is calculated (S2) by performing a Fast Fourier Transform (FFT) and the voice of said speaker Si is detected (S3) from said signal spectrum S(k-Af) by analyzing visual features extracted from a simultaneously with the recording of the analog audio se- quence s(t) recorded video sequence v(nT) tracking the current location of the speaker's face, lip movements and/or facial expressions of the speaker S,- in subsequent images. Next, the noise power density spectrum Φ_m (f) of the statistically distributed background noise τ-'(t) is estimated (S4) based on the result of the speaker detection step (S3), whereupon a sampled version Φ_ιm (k - Af) of the estimated noise power density spectrum Φ„„ (/) is subtracted (S5) from the discrete spectrum S(k-Af) of the analog-to-digital-converted audio sequence s(nT).. Finally, the corresponding discrete time-domain signal s_t(nT) of the obtained difference signal, which represents a discrete version of the recognized speech signal, is calculated (S6) by performing an Inverse Fast Fourier Transform (IFFT).

Optionally, a multi-channel acoustic echo cancellation algorithm which models echo path impulse responses by means of adaptive finite impulse response (FIR) filters and subtracts echo signals from the analog audio sequence s(t) can be conducted (S7) based on acoustic- phonetic speech characteristics derived by an algorithm for extracting visual features from a video sequence tracking the location of a speaker's face, lip movements and/or facial ex- pressions of the speaker S, in subsequent images. Said multi-channel acoustic echo cancellation algorithm thereby performs a double-talk detection procedure. According to a further aspect of the invention, a learning procedure is applied which enhances the step of detecting (S3) the voice of said speaker S, from the discrete signal spectrum S(k-Af) of the analog-to-digital-converted version s(nT) of an analog audio sequence s (t) by analyzing visual features extracted from a simultaneously with the recording of the analog audio sequence s(t) recorded video sequence tracking the current location of the speaker's face, lip movements and/or facial expressions of the speaker St in subsequent images.

hi one embodiment of the present invention, which is illustrated in the flow charts depicted in Figs. 3a+b, a near-end speaker detection method is proposed that is characterized by the step of correlating (S8a) the discrete signal spectrum S.(k-Af) of a delayed version s(nT-τ) of the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate obtained by an amplitude detection (S8b) of the band-pass-filtered discrete signal spectrum S(k-Af), thereby yielding an estimate S, (/) for the frequency spectrum S.(f) which corresponds to the signal 5,(t) representing said speaker's voice and an estimate Φ_nn (f) for the noise power density spectrum Φ„„(/) of said background noise n(f) . Moreover, the discrete signal spectrum S_τ(k-Af) of a delayed version s(nT-τ) of the analog-to-digital-converted audio signal s(nT) is conelated (S9) with a visual speech activity estimate taken from a visual feature vector o_v>t which is supplied by the visual feature extraction and analyzing means 104a+b and/or 104'+104", thus yielding a further estimate S,-'(/) for updating the estimate S. (/) for the frequency spectrum S,( ) which corresponds to the signal s,(t) representing the speaker's voice as well as a further estimate Φ„„ '(/) that is used for updating the estimate Φ_m (f) for the noise power density spectrum Φ„„ (/) of the statisti- cally distributed background noise n'(t). The noise reduction circuit 106 thereby provides a band-pass filter 204 for filtering the discrete signal spectrum S(k-Af) of the analog-to- digital-converted audio signal s(t), wherein the cut-off frequencies of said band-pass filter 204 are adjusted (S10) dependent on the bandwidth of the estimated speech signal spectrum S_t(f) . In a further embodiment of the present invention as shown in the flow charts depicted in Figs. 3a+c a near-end speaker detection method is proposed which is characterized by the step of adding (SI la) an audio speech activity estimate obtained by an amplitude detection of the band-pass-filtered discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) to a visual speech activity estimate taken from a visual feature vector o_V supplied by said visual feature extraction and analyzing means 104a+b and/or 104'+104", thereby yielding an audio-visual speech activity estimate. According to this embodiment, the discrete signal spectrum S(k- f) is conelated (SI lb) with the audio-visual speech activ- ity estimate, thus yielding an estimate S, (/) for the frequency spectrum S.{f) conespond- ing to the signal s_(t) that represents said speaker's voice as well as an estimate Φ_/(π (/) for the noise power density spectrum Φ„„ ( ) of the statistically distributed background noise 7z'(t). The cut-off frequencies of the band-pass filter 204 that is used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) are ad- justed (S 1 lc) dependent on the bandwidth of the estimated speech signal spectrum S_{ (/) .

Finally, the present invention also pertains to the use of a noise reduction system 200b/c and a corresponding near-end speaker detection method as described above for a video-telephony based application (e.g. a video conference) in a telecommunication system running on a video-enabled phone having a built-in video camera 101b' pointing at the face of a speaker S- participating in a video telephony session. This especially pertains to a scenario where a number of persons are sitting in one room equipped with many cameras and microphones such that a speaker's voice interferes with the voices of the other persons.

Table: Depicted Features and Their Corresponding Reference Signs

No. Technical Feature (System Component or Procedure Step)

100 noise reduction and speech activity recognition system having an audio-visual user interface, said system being specially adapted for running a real-time lip tracking application which combines visual features o_v,_«τ extracted from a digital video sequence v(nT) showing the face of a speaker S, by detecting and analyzing the speaker's lip movements and/or facial expressions with audio features o_a,_nτ extracted from an analog audio sequence s(t) representing the voice of said speaker S,- interfered by a statistically distributed background noise «'(t), wherein said audio sequence s(t) includes - aside from the signal representing the voice of said speaker S,- - both environmental noise n(t) and a weighted sum ∑_/ afS_j(t-Tj) (j ≠ i) of surrounding persons' interfering voices in the environment of said speaker S,-

101a microphone, used for recording an analog audio sequence s(t) representing the voice of a speaker S,- interfered by a statistically distributed background noise n'(t), which includes both environmental noise n(f) and a weighted sum ∑_y- afSj(t-Tj) (withJ ≠ i) of surrounding persons' interfering voices in the environment of said speaker S,-

101a' analog-to-digital converter (ADC), used for converting the analog audio sequence s(f) recorded by said microphone 101a into the digital domain

101b video camera pointing to the speaker' s face for recording a video sequence showing lip movements and/or facial expressions of said speaker S-

101b' video camera as described above with an integrated analog-to-digital converter (ADC)

102 video telephony application, used for transmitting a video sequence showing a speaker's face and lip movements in subsequent images

104 visual front end of an automatic audio-visual speech recognition system 100 using a bimodal approach to speech recognition and near-speaker detection by incorporating a real-time lip tracking algorithm for deriving additional visual features from lip move- No. Technical Feature (System Component or Procedure Step) ments and/or facial expressions of a speaker S_« whose voice is interfered by a statistically distributed background noise n'(t), the visual front end 104 comprising visual feature extraction and analyzing means for continuously or intermittently determining the current location of the speaker's face, tracking lip movements and/or facial expressions of the speaker S- in subsequent images and determining acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on detected lip movements and/or facial expressions

104' visual feature extraction module for continuously tracking lip movements and/or facial expressions of the speaker S,- and determining acoustic-phonetic speech characteristics of the speaker's voice based on detected lip movements and/or facial expressions

104' visual speech activity detection module for analyzing the acoustic-phonetic speech characteristics and detecting speech activity of a speaker based on said analysis

104a visual feature extraction means for continuously or intermittently determining the current location of the speaker's face recorded by a video camera 101b at a rate of 1 frame/s

104b visual feature extraction and analyzing means for continuously tracking lip movements and/or facial expressions of the speaker S, and determining acoustic-phonetic speech characteristics of said speaker's voice based on detected lip movements and/or facial expressions at a rate of 15 frames/s

106 noise reduction circuit being specially adapted to reduce statistically distributed background noise π'(t) received by said microphone 101a and perform a near-speaker detection by separating the speaker's voice from said background noise n'(t) based on a combination of the speech characteristics which are derived by said audio and visual feature extraction and analyzing means 104a+b and 106b, respectively

106a digital signal processing means for calculating the discrete signal spectrum S(k-Af) that corresponds to an analog-to-digital-converted version s(nT) of the recorded audio sequence s (t) by performing a Fast Fourier Transform (FFT)

106b audio feature extraction and analyzing means (e.g. an amplitude detector) for detecting acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based on the recorded audio sequence s(t)

106c means for estimating the noise power density spectrum Φ_m (f) of the statistically dis-

No. Technical Feature (System Component or Procedure Step)

107a multiplier element, used for correlating (S9) the discrete signal spectrum S_τ(k-Af) of a delayed version -.(π-T-τ) of the analog-to-digital-converted audio signal s(nT) with a visual speech activity estimate taken from a visual feature vector o_v,_t supplied by the visual feature extraction and analyzing means 104a+b and/or 104'+104", thereby yielding a further estimate S. ' (/) for updating the estimate S_; (/) for the frequency spectrum Si(f) corresponding to the signal -.,-(-.) which represents said speaker's voice as well as a further estimate Φ_m ' (/) for updating the estimate Φ„„ (/) for the noise power density spectrum Φ„„ (/) of the statistically distributed background noise n t)

107b multiplier element, used for correlating (S8a) the discrete signal spectrum S_τ(k-Af) of a delayed version s(nT-τ) of the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate obtained by an amplitude detection (S8b) of the bandpass-filtered discrete signal spectrum S(k-Af), thereby yielding an estimate S,(f) for the frequency spectrum S/if) corresponding to the signal s_t(t) which represents said speaker's voice as well as an estimate Φ,_!n (/) for the noise power density spectrum Φ„„ (/) of the statistically distributed background noise n'(t)

107c summation element, used for adding (SI la) the audio speech activity estimate to the visual speech activity estimate, thereby yielding an audio-visual speech activity estimate

108 multi-channel acoustic echo cancellation unit being specially adapted to perform a near- end speech detection and/or double-talk detection algorithm based on acoustic-phonetic speech characteristics derived by said audio and visual feature extraction and analyzing means 104a+b and 106b, respectively

108a means for near-end talk and/or double-talk detection, integrated in the multi-channel acoustic echo cancellation unit 108

200a block diagram showing a conventional noise reduction and speech activity recognition system for a telephony-based application based on an audio speech activity estimation according to the state of the art, wherein the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) is correlated with an audio speech activity estimate which is obtained by an amplitude detection of the digital audio signal s(nT)

200b block diagram showing an example of a slow camera-enhanced noise reduction and No. Technical Feature (System Component or Procedure Step) speech activity recognition system for a telephony-based application implementing an audio-visual speech activity estimation algorithm according to one embodiment of the present invention, wherein the discrete signal spectrum S_τ(k- f) of a delayed version s(nT-τ) of the analog-to-digital-converted audio signal s(nT) is correlated (S8a) with an audio speech activity estimate obtained by an amplitude detection (S8b) of the bandpass-filtered discrete signal spectrum S(k-Af), thereby yielding an estimate S_t (f) for the frequency spectrum S_;( ) corresponding to the signal i(t) which represents said speaker's voice and an estimate Φ_m (/) for the noise power density spectrum Φ_m (/) of the statistically distributed background noise n'(t), and also conelated (S9) with a visual speech activity estimate taken from a visual feature vector o_v,_< supplied by the visual feature extraction and analyzing means 104a+b and/or 104'+104", thereby yielding a further estimate SJ(/) for updating the estimate S,(f) for the frequency spectrum Si(f) corresponding to the signal s_(t) which represents said speaker's voice as well as a further estimate Φ„„ ' (/) for updating the estimate Φ_m (f) for the noise power density spectrum Φ_nn (f) of the statistically distributed background noise n'(t)

200c block diagram showing an example of a fast camera-enhanced noise reduction and speech activity recognition system for a telephony-based application implementing an audio-visual speech activity estimation algorithm according to a further embodiment of the present invention, wherein the discrete signal spectrum S(k-Af) of the analog-to- digital-converted audio signal s(t) is conelated (SI lb) with an audio-visual speech activity estimate, obtained by combining an audio feature vector o_a>t which is supplied by said audio feature extraction and analyzing means 106b with a visual feature vector o_V;, supplied by the visual speech activity detection module 104", thereby yielding an estimate -?,-(/) for the corresponding frequency spectrum S.( ) of the signal ^(t) which represents said speaker's voice as well as an estimate Φ„„ (f) for the noise power density spectrum Φ_mι (/) of the statistically distributed background noise n'(f)

202 delay element, providing a delayed version of the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) No. Technical Feature (System Component or Procedure Step)

204 band-pass filter with adjustable cut-off frequencies which can be adjusted dependent on the bandwidth of the estimated speech signal spectrum S_{(f) , used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t)

300a flow chart illustrating a near-end speaker detection method reducing the noise level of a detected analog audio sequence s(t) according to the embodiment depicted in Fig. 1 of the present invention

300b flow chart illustrating a near-end speaker detection method according to the embodiment depicted in Fig. 2b of the present invention

300c flow chart illustrating a near-end speaker detection method according to the embodiment depicted in Fig. 2c of the present invention

I SW means for switching said microphone 101a off when the actual level of the speech activity indication signal s_t(nT) falls below a predefined threshold value (not shown)

SI step #1: subjecting the analog audio sequence s(t) to an analog-to-digital conversion

S10 step #10: adjusting the cut-off frequencies of the band-pass filter 204 used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal (s(ή) dependent on the bandwidth of the estimated speech signal spectrum S, ( )

Sl la step #l la: adding an audio speech activity estimate which is obtained by an amplitude detection of the band-pass-filtered discrete signal spectrum S(k-Af) of the analog-to- digital-converted audio signal s(t) to a visual speech activity estimate taken from a visual feature vector o_v>t supplied by the visual feature extraction and analyzing means 104a+b and/or 104'+ 104", thereby yielding an audio-visual speech activity estimate

Sllb step #1 lb: conelating the discrete signal spectrum S(k-Af) with the audio-visual speech activity estimate, thereby yielding an estimate S, ( ) for the frequency spectrum S,( ) conesponding to the signal S((f) which represents said speaker's voice as well as an estimate Φ_nn (f) for said noise power density spectrum Φ_m (f)

Sl lc step #l lc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) dependent on the bandwidth of the estimated speech signal spectrum S,. ( )

No. Technical Feature (System Component or Procedure Step)

S9 step #9: conelating the discrete signal spectrum S_.(k-Af) of a delayed version s(nT-τ) of the analog-to-digital-converted audio signal s(nT) with a visual speech activity estimate taken from a visual feature vector o_Vtt supplied by the visual feature extraction and analyzing means 104a+b and/or 104'+104", thereby yielding a further estimate Sr(f) for updating the estimate S. (/) for the frequency spectrum S( ) corresponding to the signal s_t(t) which represents said speaker's voice as well as a further estimate Φ_nn '(/) for updating the estimate Φ„„ (/) for the noise power density spectrum Φ„„ (f) of the statistically distributed background noise n'(t) ι S10 step #10: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) dependent on the bandwidth of the estimated speech signal spectrum S,(/)

; Sl la step #lla: adding an audio speech activity estimate obtained by an amplitude detection of the band-pass-filtered discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) to a visual speech activity estimate taken from a visual feature vector o_v>t supplied by said visual feature extraction and analyzing means 104a+b, and/or 104'+104", thereby yielding an audio-visual speech activity estimate

Sl lb step #l lb: conelating the discrete signal spectrum S(k-Af) with the audio-visual speech activity estimate, thus yielding an estimate S,- (/) for the frequency spectrum Si(f) corresponding to the signal -?,-(/) which represents said speaker's voice as well as an estimate Φ_m (f) for the noise power density spectrum Φ„_π (/) of the statistically distributed background noise n'(f) i Sllc step #llc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) dependent on the bandwidth of the estimated speech signal spectrum S_t(f)

Claims

1. A noise reduction system with an audio- visual user interface, said system being specially adapted for running an application for combining visual features (o_v,_nτ) extracted from a digital video sequence (v(nT)) showing the face of a speaker (S,) with audio features (Q_a^r) extracted from an analog audio sequence (s(t)), wherein said audio sequence (s(t)) can include noise in the environment of said speaker (Si), said noise reduction system (200b/c) comprising

- means (101a, 106b) for detecting and analyzing said analog audio sequence (s(t)), - means (101b') for detecting said video sequence (v(nT)), and

- means (104a+b, 104'+104") for analyzing the detected video signal (v(nT)), characterized by a noise reduction circuit (106) being adapted to separate the speaker's voice from said background noise (ra'(t)) based on a combination of derived speech characteristics (o_αv_„r := [Oa^ , Q_v,_nT _^T) an outputting a speech activity indication signal ( J, (nT) ) which is obtained by a combination of speech activity estimates supplied by said analyzing means (106b, 104a+b, 104'+104").

2. A noise reduction system according to claim 1, characterized by means (S W) for switching off an audio channel in case the actual level of said speech activity indication signal (s_t(nT)) falls below a predefined threshold value.

3. A noise reduction system according to anyone of the claims 1 or 2, characterized by a multi-channel acoustic echo cancellation unit (108) being specially adapted to perform a near-end speaker detection and double-talk detection algorithm based on acoustic-phonetic speech characteristics derived by said audio feature extraction and analyzing means (106b) and said visual feature extraction and analyzing means (104a+b, 104'+104").

4. A noise reduction system according to anyone of the claims 1 to 3, characterized in that said audio feature extraction and analyzing means (106b) is an amplitude detector.

5. A near-end speaker detection method reducing the noise level of a detected analog audio sequence (s (t)), said method being characterized by the following steps:

- subjecting (SI) said analog audio sequence (s(t)) to an analog-to-digital conversion,

- calculating (S2) the conesponding discrete signal spectrum (S(k-Af)) of the analog-to- digital-converted audio sequence (s(nT)) by performing a Fast Fourier Transform (FFT),

- detecting (S3) the voice of said speaker (Si) from said signal spectrum (S(k-Af)) by analyzing visual features (o_v,_nτ) extracted from a simultaneously with the recording of the analog audio sequence (s(t)) recorded video sequence (v(nT)) tracking the current location of the speaker's face, lip movements and/or facial expressions of the speaker (Si) in subsequent images,

- estimating (S4) the noise power density spectrum (Φ_nn ( )) of the statistically distributed background noise (n(t)) based on the result of the speaker detection step (S3),

- subtracting (S5) a discretized version (Φ _m (k ^■ Af)) of the estimated noise power density spectrum (Φ_nn ( )) from the discrete signal spectrum (S(k-Af)) of the analog-to- digital-converted audio sequence (s(nT)), and

- calculating (S6) the conesponding discrete time-domain signal (s^nT)) of the obtained difference signal by performing an Inverse Fast Fourier Transform (IFFT), thereby yielding a discrete version of the recognized speech signal.

6. A near-end speaker detection method according to claim 5, characterized by the step of conducting (S7) a multi-channel acoustic echo cancellation algorithm which models echo path impulse responses by means of adaptive finite impulse response (FIR) filters and subtracts echo signals from the analog audio sequence (s(t)) based on acoustic-phonetic speech characteristics derived by an algorithm for extracting visual features (o_v,„τ) from a video sequence (v(nT)) tracking the location of a speaker's face, lip movements and/or facial expressions of the speaker (Si) in subsequent images.

7. A near-end speaker detection method according to claim 6, characterized in that said multi-channel acoustic echo cancellation algorithm performs a double-talk detection procedure.

8. A near-end speaker detection method according to anyone of the claims 5 to 7, characterized in that said acoustic-phonetic speech characteristics are based on the opening of a speaker's mouth as an estimate of the acoustic energy of articulated vowels or diphthongs, respectively, rapid movement of the speaker's lips as a hint to labial or labio-dental consonants, respectively, and other statistically detected phonetic characteristics of an association between position and movement of the lips and the voice and pronunciation of said speaker (S,).

9. A near-end speaker detection method according to anyone of the claims 5 to 8, characterized by a learning procedure used for enhancing the step of detecting (S3) the voice of said speaker (Si) from the discrete signal spectrum (S(k-Af)) of the analog-to-digital-converted version (s(nT)) of an analog audio sequence (s(t)) by analyzing visual features (o_v,„τ) extracted from a simultaneously with the recording of the analog audio sequence (s(t)) recorded video sequence (v(nT)) tracking the current location of the speaker's face, lip movements and/or facial expressions of the speaker (Si) in subsequent images.

10. A near-end speaker detection method according to anyone of the claims 5 to 9, characterized by the step of conelating (S8a) the discrete signal spectrum (S_τ(k-Af)) of a delayed version (s(nT-τ)) of the analog-to-digital-converted audio signal (s(nT)) with an audio speech activity estimate ob- tained by an amplitude detection (S8b) of the band-pass-fϊltered discrete signal spectrum

(S(k-Af), thereby yielding an estimate (S,. (/)) for the frequency spectrum (S/if)) cone- sponding to the signal (s_t(t)) which represents said speaker's voice as well as an estimate (Φ -_.« (/)) for the noise power density spectrum (Φ „„ (/)) of the statistically distributed background noise (n'(t)).

11. A near-end speaker detection method according to claim 10, characterized by the step of conelating (S9) the discrete signal spectrum (S_τ(k-Af)) of a delayed version (s(nT-τ)) of the analog-to-digital-converted audio signal (s(nT)) with a visual speech activity estimate taken from a visual feature vector (o_v>t) supplied by the visual feature extraction and analyzing means (104a+b, 104'+104"), thereby yielding a further estimate (S, '(/)) for updating the estimate (S₍. (/)) for the frequency spectrum (St(f)) conesponding to the signal (-.,(t)) which represents said speaker's voice as well as a further estimate (Φ_nn ' (/)) for updating the estimate (Φ „„ (/)) for the noise power density spectrum (Φ „„(/)) of the statistically distributed background noise (n'(t)).

12. A near-end speaker detection method according anyone of the claims 10 or 11, characterized by the step of adjusting (S10) the cut-off frequencies of a band-pass filter (204) used for filtering the discrete signal spectrum (S(k-Af)) of the analog-to-digital-converted audio signal (s(t)) de- pendent on the bandwidth of the estimated speech signal spectrum (S,- ( )) .

13. A near-end speaker detection method according to anyone of the claims 5 to 9, characterized by the steps of

- adding (S 11 a) an audio speech activity estimate obtained by an amplitude detection of the band-pass-filtered discrete signal spectrum (S(k-Af)) of the analog-to-digital- converted audio signal (s(t)) to a visual speech activity estimate taken from a visual feature vector (o_V!i) supplied by said visual feature extraction and analyzing means (104a+b, 104'+104"), thereby yielding an audio-visual speech activity estimate,

- conelating (SI lb) the discrete signal spectrum (S(k-Af)) with the audio-visual speech activity estimate, thereby yielding an estimate (S. ( )) for the frequency spectrum (S,( )) conesponding to the signal (sfø) which represents said speaker's voice as well as an estimate (Φ „„ (/)) for the noise power density spectrum (Φ „„ (/)) of the statistically distributed background ^"noise (n'(t)) and - adjusting (S 11 c) the cut-off frequencies of a band-pass filter (204) used for filtering the discrete signal spectrum (S(k-Af)) of the analog-to-digital-converted audio signal (s(t)) dependent on the bandwidth of the estimated speech signal spectrum (S_{(f)) .

14. Use of a noise reduction system (200b/c) according to anyone of the claims 1 to 4 and a near-end speaker detection method according to anyone of the claims 5 to 13 for a video- telephony based application in a telecommunication system running on a video-enabled phone with a built-in video camera (101b') pointing at the face of a speaker (Si) participating in a video telephony session.

15. A telecommunication device equipped with an audio-visual user interface, characterized by noise reduction system (200b/c) according to anyone of the claims 1 to 4.